SlideShare a Scribd company logo
1 of 106
Download to read offline
+

Sunday, July 24, 2011
ajackson
                              @
                   skylineinnovations.com


Sunday, July 24, 2011
a tale of rapid
                          prototyping, data
                          warehousing, solar
                        power, an architecture
                          designed for data
                          analysis at “scale”
                           ...and arduinos!
Sunday, July 24, 2011

So here’s what i’d like to talk about: Who we are, how we got started, and most importantly,
how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while
i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s
flexible nature really helped us as a business, and how Mongo specifically has been a good
choice for us as we build some of our tools. Here are some themes:
Scaling



Sunday, July 24, 2011

Mongo has come to have a pretty strong association with the word “scaling.”

Scaling is a word we throw around a lot, and it almost always means “software performance,
as inputs grow by orders of magnitude.”

But scaling also means performance as the variety of inputs increases. I’d argue that it’s
scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to
a hundred.

There’s another word for this.
Scaling
                                Flexibility


Sunday, July 24, 2011

Particularly when you scale in the real world, you start to find that it’s complicated and messy
and entropic in ways that software isn’t always equipped to handle. So for us, when we say
“mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come
back to them as well.
Business-first
                        development


Sunday, July 24, 2011

This generally means flexibile, lightweight processes. Things that become fixed &
unchangable quickly become obsolete and sad :’(
When Does
                “Context”
               become “Yak
                 Shaving”?


Sunday, July 24, 2011

When i read new things or hear about new stuff, I’m always trying to put it in context. So,
sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast
over the context that *is* important. So please stop me to ask questions! Also, the problem
domain here is a little different than what we might be used to, so bear with me as we go into
plumbing & construction.
Preliminaries



Sunday, July 24, 2011
Est. 8/2009
Sunday, July 24, 2011
Project Development
                                 +
                             Technology


Sunday, July 24, 2011
“Project Development”
Sunday, July 24, 2011
finance, develop, and operate
                 renewable energy and efficiency
                   installations, for measurable,
                        guaranteed savings.



Sunday, July 24, 2011
finance, develop, and
                    operate renewable energy
                   and efficiency installations, for
                   measurable, guaranteed savings.



Sunday, July 24, 2011

We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
finance, develop, and operate
                    renewable energy and
                  efficiency installations, for
                  measurable, guaranteed savings.



Sunday, July 24, 2011

Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
finance, develop, and operate
                  renewable energy and efficiency
                  installations, for measurable,
                      guaranteed savings.



Sunday, July 24, 2011

So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that
money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the
twist. Other companies have done similar things, where they say “we’ll pay for a system/
retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get
savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So,
we actually measure the performance of this stuff, collect the data, and guarantee that you
save money.
(not webapps)



Sunday, July 24, 2011
Topics not covered:



Sunday, July 24, 2011
• Why solar thermal?
                        • Why hasn’t anyone else done this before?
                        • Pivots? Iterations?
                        • What’s the market size?
                        • Funding? Capital structures?
                        • Wait, how do you guys make money?

Sunday, July 24, 2011

Oh, right, this isn’t a startup talk. But feel free to ask me these later!
Solar Thermal in Five
                               Minutes
                            ( mongo next, i promise! )




Sunday, July 24, 2011
Municipal
                           =>
                          Roof
                           =>
                          Tank
                           =>
                        Customer
Sunday, July 24, 2011
Relevant Data to Track



Sunday, July 24, 2011
Temperatures
                        (about a dozen)


Sunday, July 24, 2011
Flow Rates
                        (at least two)


Sunday, July 24, 2011
Parallel data streams
                          (hopefully many)


Sunday, July 24, 2011

e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
how much data?
                        20 data points @ 4 bytes
                        1 minute intervals
                        at 1000 projects (I wish!)
                        for 10 years
                        80 * 60 * 24 * 365 * 10 * 1000 = 400 GB?
                        ...not much, really, “in the raw”


Sunday, July 24, 2011

unfortunately, we can’t really store it with maximal efficiency, because of things like
timestamps, metadata, etc., but still.
Sunday, July 24, 2011

I hope this provides enough context on the business problems we’re trying to solve. It looks
like we’ll need a data pipeline, and we’ll need one fast.

We’ve got data that we’ll need to use to build, monitor, and monetize these energy
technologies. Having worked at other smart grid companies before, I’ve seen some good
data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i
have to build, the better.
Sunday, July 24, 2011

As i do some research, i find that a lot of these data pipelines have a few well-defined areas
of responsibility.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.            <=           Users are here




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.             <=     Users are here
                                                Business value is here!




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Sunday, July 24, 2011

so, here’s how i started thinking about things. This is a design diagram from the early days
of the company.
Sunday, July 24, 2011

easy, python, no problem. There are some interesting topics here, but they’re not mongoDB
related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the
data would look like.
Sunday, July 24, 2011

This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some
light webapps for internal use. These would be dictated by business goals first, but the
technological questions were straightforward.
Sunday, July 24, 2011

Here was the real question.

What would be some use cases of an analyst having a good experience look like? What would
they expect the tools to do?
Now we can think
                        about what the data
                             looks like


Sunday, July 24, 2011

So, let’s think about what this data looks like, how it’s structured and what it is. Then, after
that, we can look at what the best ways to organize it for future usefulness.
Time series?
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
TIME SERIES
                           DATA


Sunday, July 24, 2011

So what is time series data?
Features, Over Time




Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Sunday, July 24, 2011

A couple of ideas:
sampling rates. “regularity”. “completeness”
analog vs. digital
instantaneous vs. cumulative (tradeoffs)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
(more complicated stuff
                   can be thought of as
                    transformations...)


Sunday, July 24, 2011

e.g., frequency analysis, wavelets, whatever.
Sunday, July 24, 2011

At this point, I go off and do a bunch of research on existing technologies. I really hate
reinventing the wheel, and we really don’t have the manpower.
Time series specific tools



                        Scientific tools & libraries



                        Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools.
Time series specific tools

                           RRDtool -- Round Robin Database




Sunday, July 24, 2011

There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and
i highly recommend it. Unfortunately, it’s really designed for applications that are highly
regular, and that are already pretty digital, for instance, sampling latencies, or temperatures
in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long
term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get
me wrong, it’s totally rad, but i didn’t think it was for us.
Scientific tools & libraries

                           e.g., PyTables




Sunday, July 24, 2011

Pretty cool, but not many of these were mature & ready for primetime. Some that were, like
PyTables, didn’t really match our business use-case.
Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools. [...]. That leaves us with the traditional approaches. This
represents a pretty well established field, but very few of the tools are free, lightweight, and
mature.
Enterprise buzzwords
                           (Just google for OLAP)




Sunday, July 24, 2011



But the biggest idea i learned is that most data warehousing revolves around the idea of a
“fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally
denormalized SQL table.
“Measures”
                          and their
                        “Dimensions”


Sunday, July 24, 2011

(or facts)
pretty neat!
Sunday, July 24, 2011
“how elegant!”

Sunday, July 24, 2011
in practice...



Sunday, July 24, 2011
Sunday, July 24, 2011
(from “How to Build OLAP Application Using Mondrian
                                + XMLA + SpagoBI”)
Sunday, July 24, 2011

to which the only acceptable response is:
Sunday, July 24, 2011

ha! Yeah right.
Time series are not relational!
Sunday, July 24, 2011

even extracted features are not inherently relational!

Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t
know when you’ll have to start looking for something different.
Why would you lock yourself into a schema?
We don’t know what
                        we’ll want to know.


Sunday, July 24, 2011

We won’t know what we want to know. Not only are we warehousing time-series of
multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in
yet!
natural fit for
                          documents


Sunday, July 24, 2011

This makes a schema-less database a natural fit for these sorts of things. Think about all the
alter-table calls i’ve avoided...
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

isn’t this better?
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,      “measures”
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                                                                         “dimensions”
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                                                                                         ...right?
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently
we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
...actually, not a good
                                model.


Sunday, July 24, 2011

The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure
provides another dimension.
Anyway!
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

How do we build these quickly & efficiently?
the goal: good numbers!



Sunday, July 24, 2011

Remember, the goal here is to make it easy for analysts to get comparable numbers, so when
i ask for the delivered energy for one system, compared to the delivered energy from
another, i can just get the time-series data, without having to worry about if sensors
changed, when the network was out, when a logger was replaced with another one, etc.
Sunday, July 24, 2011

So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV
series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
from rows
                             to columns


Sunday, July 24, 2011

So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful
way. I’m gonna walk through that process, quickly.
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {


                                                                       Let’s just look at one
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011
row-major data
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
“Functional”
                        class Mass(BasicMeasure):
                            def __init__(self, density, volume):
                                ...

                                self._result_func = functools.partial(
                                     lambda data, density, volume: density * volume(data)
                                     density=density, volume=volume)

                            def __call__(self, data):
                               return self._result_func(data)




Sunday, July 24, 2011

quasi-functional classes that describe how to calculate a value from data.
"_id" : {
                                        "install.name" : "agni-3501",
                                        "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                        "frequency" : "daily" },
                                "measures" : {
                                        "total-delta" : -85.78773442284201,
                                        "Energy Sold" : 450087.1186574721,
                                        "Generation" : 57273.159890170136,
                                        "consumed-delta" : 12.569841951556597,




                                                        A formula:

                                                      E = ∆t × F
                        #pseudocode
                        class LoopEnergy(BasicMeasure):
                            def __init__(self, heat_cap, delta, mass):
                                ...
                                def result_func(data):
                                    return self.delta(data) * self.mass(data) * self.heat_cap
                                self._result_func = result_func

                            def __call__(self, data):
                                return self._result_func(data)




Sunday, July 24, 2011
Creating a Cube
                        For each install, for each chunk of data:

                            apply all known formulas to get values

                            make some convenience keys (e.g., day_of_year)

                            stuff it in mongo

                         Then, map/reduce to whatever dimensionalities you’re
                         interested in: e.g., downsampling.




Sunday, July 24, 2011

Here’s some pseudocode for how to make a cube of multidimensional data.
So, what’s the payoff?
How much water did
                         [x] use, monthly?
                > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold":
                1}).sort({“_id”: 1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What were our highest
                    production days?
                > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy
                Sold”: -1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
How does the distribution of [x]
                 on the weekend compare to its
                  distribution on the weekdays?
                > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}})
                > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}})
                > do stuff




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What’s the production of installs north of a certain
                        latitude, with a certain class of panel, on Tuesdays?

                        For hours where the average delivered temperature
                        delta was above [x], what was our generation
                        efficiency?

                        Normalize by number of panels? (map/reduce)

                        Normalize by distance from equinox? (map/reduce)

                        ...etc.



Sunday, July 24, 2011
• Building a cube can be done in parallel
                        • Map/reduce is an easy way to think about
                          transforms.

                        • Not maximally efficient, but parallelizes on
                          commodity hardware.




Sunday, July 24, 2011

Some advantages.
re #3 -- so what? It’s not a webapp.
mongoDB:
                        The future of enterprise
                         business intelligence.
                           (they just don’t know it yet)




Sunday, July 24, 2011

So, here’s my thesis:
document-databases are far superior to relational databases for business intelligence cases.
Not only that, but mongoDB and some common sense lets you replace multimillion dollar
IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
Lastly...



Sunday, July 24, 2011
Mongo expands in an
                           organization.


Sunday, July 24, 2011

it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot
of other schema-loose data that we could use it for -- like the definitions of the measures
themselves, or the details about an install, etc., etc.
Final Thoughts



Sunday, July 24, 2011

Ok, i want to close up with a few jumping-off points.
“Business Intelligence”
                          no longer requires
                              megabucks


Sunday, July 24, 2011
Flexible tools means
                 business responsiveness
                      should be easy


Sunday, July 24, 2011
“Scaling” doesn’t just
                          mean depth-first.


Sunday, July 24, 2011

businesses grow deep, in the sense of adding more users, but they also grow broad.
Questions?



Sunday, July 24, 2011
Epilogue
                        Quest for Logging Hardware




Sunday, July 24, 2011
This’ll be easy!
        This is such an obvious and well
          explored problem space, i’m
           sure we’ll be able to find a
        solution that matches our needs
           without breaking the bank!




Sunday, July 24, 2011
Shopping List!
           16 temperature sensors
                4 flow sensors
        maybe some miscellaneous ones
              internet backhaul
           no software/data lock in.




Sunday, July 24, 2011
Conventions
                  FTW!
        And since we’ve walked a couple
         convention floors and product
         catalogs from major industrial
         supply vendors, i’m sure it’s in
               here somewhere!




Sunday, July 24, 2011
derp derp
                    “internet”?
        I’m sure there’s a reason why all
        of these loggers have to connect
                    via USB...
                         Pace Scientific XR5:
                              8 analog
                               3 pulse
                              ONE MB
                            no internet?
                               $500?!?



Sunday, July 24, 2011
yay windows?
            ...and require proprietary
              (windows!) software or
         subscription plans that route my
            data through their servers

                        (basically all of them!)



Sunday, July 24, 2011
Maybe the gov’t
          can help!
           Perhaps there’s some kind of
          standard that the governments
              require for solar thermal
             monitoring systems to be
            eligible for incentives or tax
                        credits.



Sunday, July 24, 2011
Vive la France!
              An obscure standard by the
                   Organisation
                Internationale de
                Métrologie Légale
                   appears! Neat!




Sunday, July 24, 2011
A “Certified”
                  Logger
                 two temperature sensors
                         one pulse
                  no increase in accuracy
                  no data backhaul -- at all
                             ...
                     what’s the price?



Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
Hmm...
            I can solder, and arduinos are
                     pretty cheap




Sunday, July 24, 2011
It’s on!




Sunday, July 24, 2011
arduino + netbook!
Sunday, July 24, 2011
TL; DR:
                        Existing loggers
                          are terrible.


Sunday, July 24, 2011

Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
•   http://www.flickr.com/photos/rknight/4358119571/

                        •   http://4.bp.blogspot.com/_8vNzwxlohg0/
                            TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/
                            s320/turtles-all-the-way-down.jpg

                        •   http://www.flickr.com/photos/rhk313/3801302914/

                        •   http://www.flickr.com/photos/benny_lin/481411728/

                        •   http://spagobi.blogspot.com/
                            2010_08_01_archive.html

                        •   http://community.qlikview.com/forums/t/37106.aspx


Sunday, July 24, 2011

More Related Content

What's hot (20)

Biometrics Technology Intresting PPT
Biometrics Technology Intresting PPT Biometrics Technology Intresting PPT
Biometrics Technology Intresting PPT
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
data hiding techniques.ppt
data hiding techniques.pptdata hiding techniques.ppt
data hiding techniques.ppt
 
Module 2 Foot Printing
Module 2   Foot PrintingModule 2   Foot Printing
Module 2 Foot Printing
 
Network sniffers & injection tools
Network sniffers  & injection toolsNetwork sniffers  & injection tools
Network sniffers & injection tools
 
Cyber Forensics Module 1
Cyber Forensics Module 1Cyber Forensics Module 1
Cyber Forensics Module 1
 
Cyber Forensics Overview
Cyber Forensics OverviewCyber Forensics Overview
Cyber Forensics Overview
 
E-mail Investigation
E-mail InvestigationE-mail Investigation
E-mail Investigation
 
Digital Forensic
Digital ForensicDigital Forensic
Digital Forensic
 
Business intelligence ppt
Business intelligence pptBusiness intelligence ppt
Business intelligence ppt
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
FINGERPRINT BASED ATM SYSTEM
FINGERPRINT BASED ATM SYSTEMFINGERPRINT BASED ATM SYSTEM
FINGERPRINT BASED ATM SYSTEM
 
Network security ppt
Network security pptNetwork security ppt
Network security ppt
 
Network security
Network securityNetwork security
Network security
 
Biometric Security Systems ppt
Biometric Security Systems pptBiometric Security Systems ppt
Biometric Security Systems ppt
 
Face recognition ppt
Face recognition pptFace recognition ppt
Face recognition ppt
 
Facial recognition
Facial recognitionFacial recognition
Facial recognition
 
Chap 2 computer forensics investigation
Chap 2  computer forensics investigationChap 2  computer forensics investigation
Chap 2 computer forensics investigation
 
Network forensics1
Network forensics1Network forensics1
Network forensics1
 
Botnets
BotnetsBotnets
Botnets
 

Viewers also liked

MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series DataMongoDB
 
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Using MongoDB As a Tick Database
Using MongoDB As a Tick DatabaseUsing MongoDB As a Tick Database
Using MongoDB As a Tick DatabaseMongoDB
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in CassandraEric Evans
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLBasho Technologies
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBrian Enochson
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
Resilience an engineering construction perspective
Resilience an engineering construction perspectiveResilience an engineering construction perspective
Resilience an engineering construction perspectiveBob Prieto
 
International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1JournalsPub www.journalspub.com
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series dataAnuj Sahni
 

Viewers also liked (20)

MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema Design
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Using MongoDB As a Tick Database
Using MongoDB As a Tick DatabaseUsing MongoDB As a Tick Database
Using MongoDB As a Tick Database
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Resilience an engineering construction perspective
Resilience an engineering construction perspectiveResilience an engineering construction perspective
Resilience an engineering construction perspective
 
Riak TS
Riak TSRiak TS
Riak TS
 
International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series data
 

Similar to Time Series Data Storage in MongoDB

Operations as a Strategic Weapon
Operations as a Strategic WeaponOperations as a Strategic Weapon
Operations as a Strategic WeaponJohn Willis
 
IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011Lead & Transform
 
Drupal as a winning Web Platform
Drupal as a winning Web PlatformDrupal as a winning Web Platform
Drupal as a winning Web PlatformChapter Three
 
Building an experimentation framework
Building an experimentation frameworkBuilding an experimentation framework
Building an experimentation frameworkzsqr
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunk
 
How to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar TeamHow to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar TeamGlenn Hilton
 
How to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal CommunityHow to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal CommunityMediacurrent
 
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Balanced Team
 
MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaPeter O'Kelly
 
Wibiya founders at The Junction
Wibiya founders at The JunctionWibiya founders at The Junction
Wibiya founders at The JunctionDaniel Tal
 
CMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social DrupalCMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social DrupalBlake Hall
 
LISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps TransformationLISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps Transformationbenrockwood
 
SecurityBSides las vegas - Agnitio
SecurityBSides las vegas - AgnitioSecurityBSides las vegas - Agnitio
SecurityBSides las vegas - AgnitioSecurity Ninja
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastoreikailan
 

Similar to Time Series Data Storage in MongoDB (20)

Operations as a Strategic Weapon
Operations as a Strategic WeaponOperations as a Strategic Weapon
Operations as a Strategic Weapon
 
IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011
 
Drupal as a winning Web Platform
Drupal as a winning Web PlatformDrupal as a winning Web Platform
Drupal as a winning Web Platform
 
Building an experimentation framework
Building an experimentation frameworkBuilding an experimentation framework
Building an experimentation framework
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrack
 
How to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar TeamHow to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar Team
 
How to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal CommunityHow to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal Community
 
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
 
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
 
Alternative Software Development Methodology
Alternative Software Development MethodologyAlternative Software Development Methodology
Alternative Software Development Methodology
 
Agile xptdd@gosoft
Agile xptdd@gosoftAgile xptdd@gosoft
Agile xptdd@gosoft
 
Agile xp tdd@gosoft
Agile xp tdd@gosoftAgile xp tdd@gosoft
Agile xp tdd@gosoft
 
MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery Enigma
 
Wibiya founders at The Junction
Wibiya founders at The JunctionWibiya founders at The Junction
Wibiya founders at The Junction
 
CMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social DrupalCMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social Drupal
 
Promise notes
Promise notesPromise notes
Promise notes
 
Web heresies
Web heresiesWeb heresies
Web heresies
 
LISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps TransformationLISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps Transformation
 
SecurityBSides las vegas - Agnitio
SecurityBSides las vegas - AgnitioSecurityBSides las vegas - Agnitio
SecurityBSides las vegas - Agnitio
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Time Series Data Storage in MongoDB

  • 2. ajackson @ skylineinnovations.com Sunday, July 24, 2011
  • 3. a tale of rapid prototyping, data warehousing, solar power, an architecture designed for data analysis at “scale” ...and arduinos! Sunday, July 24, 2011 So here’s what i’d like to talk about: Who we are, how we got started, and most importantly, how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s flexible nature really helped us as a business, and how Mongo specifically has been a good choice for us as we build some of our tools. Here are some themes:
  • 4. Scaling Sunday, July 24, 2011 Mongo has come to have a pretty strong association with the word “scaling.” Scaling is a word we throw around a lot, and it almost always means “software performance, as inputs grow by orders of magnitude.” But scaling also means performance as the variety of inputs increases. I’d argue that it’s scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to a hundred. There’s another word for this.
  • 5. Scaling Flexibility Sunday, July 24, 2011 Particularly when you scale in the real world, you start to find that it’s complicated and messy and entropic in ways that software isn’t always equipped to handle. So for us, when we say “mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come back to them as well.
  • 6. Business-first development Sunday, July 24, 2011 This generally means flexibile, lightweight processes. Things that become fixed & unchangable quickly become obsolete and sad :’(
  • 7. When Does “Context” become “Yak Shaving”? Sunday, July 24, 2011 When i read new things or hear about new stuff, I’m always trying to put it in context. So, sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast over the context that *is* important. So please stop me to ask questions! Also, the problem domain here is a little different than what we might be used to, so bear with me as we go into plumbing & construction.
  • 10. Project Development + Technology Sunday, July 24, 2011
  • 12. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011
  • 13. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
  • 14. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
  • 15. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the twist. Other companies have done similar things, where they say “we’ll pay for a system/ retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So, we actually measure the performance of this stuff, collect the data, and guarantee that you save money.
  • 18. • Why solar thermal? • Why hasn’t anyone else done this before? • Pivots? Iterations? • What’s the market size? • Funding? Capital structures? • Wait, how do you guys make money? Sunday, July 24, 2011 Oh, right, this isn’t a startup talk. But feel free to ask me these later!
  • 19. Solar Thermal in Five Minutes ( mongo next, i promise! ) Sunday, July 24, 2011
  • 20. Municipal => Roof => Tank => Customer Sunday, July 24, 2011
  • 21. Relevant Data to Track Sunday, July 24, 2011
  • 22. Temperatures (about a dozen) Sunday, July 24, 2011
  • 23. Flow Rates (at least two) Sunday, July 24, 2011
  • 24. Parallel data streams (hopefully many) Sunday, July 24, 2011 e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
  • 25. how much data? 20 data points @ 4 bytes 1 minute intervals at 1000 projects (I wish!) for 10 years 80 * 60 * 24 * 365 * 10 * 1000 = 400 GB? ...not much, really, “in the raw” Sunday, July 24, 2011 unfortunately, we can’t really store it with maximal efficiency, because of things like timestamps, metadata, etc., but still.
  • 26. Sunday, July 24, 2011 I hope this provides enough context on the business problems we’re trying to solve. It looks like we’ll need a data pipeline, and we’ll need one fast. We’ve got data that we’ll need to use to build, monitor, and monetize these energy technologies. Having worked at other smart grid companies before, I’ve seen some good data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i have to build, the better.
  • 27. Sunday, July 24, 2011 As i do some research, i find that a lot of these data pipelines have a few well-defined areas of responsibility.
  • 28. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 29. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 30. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 31. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Business value is here! Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 32. Sunday, July 24, 2011 so, here’s how i started thinking about things. This is a design diagram from the early days of the company.
  • 33. Sunday, July 24, 2011 easy, python, no problem. There are some interesting topics here, but they’re not mongoDB related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the data would look like.
  • 34. Sunday, July 24, 2011 This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some light webapps for internal use. These would be dictated by business goals first, but the technological questions were straightforward.
  • 35. Sunday, July 24, 2011 Here was the real question. What would be some use cases of an analyst having a good experience look like? What would they expect the tools to do?
  • 36. Now we can think about what the data looks like Sunday, July 24, 2011 So, let’s think about what this data looks like, how it’s structured and what it is. Then, after that, we can look at what the best ways to organize it for future usefulness.
  • 37. Time series? Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 38. TIME SERIES DATA Sunday, July 24, 2011 So what is time series data?
  • 39. Features, Over Time Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 40. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 41. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 42. Sunday, July 24, 2011 A couple of ideas: sampling rates. “regularity”. “completeness” analog vs. digital instantaneous vs. cumulative (tradeoffs)
  • 43. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 44. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 45. t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 46. y t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 47. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 48. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 49. (more complicated stuff can be thought of as transformations...) Sunday, July 24, 2011 e.g., frequency analysis, wavelets, whatever.
  • 50. Sunday, July 24, 2011 At this point, I go off and do a bunch of research on existing technologies. I really hate reinventing the wheel, and we really don’t have the manpower.
  • 51. Time series specific tools Scientific tools & libraries Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools.
  • 52. Time series specific tools RRDtool -- Round Robin Database Sunday, July 24, 2011 There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and i highly recommend it. Unfortunately, it’s really designed for applications that are highly regular, and that are already pretty digital, for instance, sampling latencies, or temperatures in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get me wrong, it’s totally rad, but i didn’t think it was for us.
  • 53. Scientific tools & libraries e.g., PyTables Sunday, July 24, 2011 Pretty cool, but not many of these were mature & ready for primetime. Some that were, like PyTables, didn’t really match our business use-case.
  • 54. Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools. [...]. That leaves us with the traditional approaches. This represents a pretty well established field, but very few of the tools are free, lightweight, and mature.
  • 55. Enterprise buzzwords (Just google for OLAP) Sunday, July 24, 2011 But the biggest idea i learned is that most data warehousing revolves around the idea of a “fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally denormalized SQL table.
  • 56. “Measures” and their “Dimensions” Sunday, July 24, 2011 (or facts)
  • 61. (from “How to Build OLAP Application Using Mondrian + XMLA + SpagoBI”) Sunday, July 24, 2011 to which the only acceptable response is:
  • 62. Sunday, July 24, 2011 ha! Yeah right.
  • 63. Time series are not relational! Sunday, July 24, 2011 even extracted features are not inherently relational! Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t know when you’ll have to start looking for something different. Why would you lock yourself into a schema?
  • 64. We don’t know what we’ll want to know. Sunday, July 24, 2011 We won’t know what we want to know. Not only are we warehousing time-series of multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in yet!
  • 65. natural fit for documents Sunday, July 24, 2011 This makes a schema-less database a natural fit for these sorts of things. Think about all the alter-table calls i’ve avoided...
  • 66. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 isn’t this better?
  • 67. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, “measures” "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, “dimensions” "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, ...right? "year" : 2010, "day" : 6 Sunday, July 24, 2011 measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
  • 68. ...actually, not a good model. Sunday, July 24, 2011 The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure provides another dimension. Anyway!
  • 69. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 How do we build these quickly & efficiently?
  • 70. the goal: good numbers! Sunday, July 24, 2011 Remember, the goal here is to make it easy for analysts to get comparable numbers, so when i ask for the delivered energy for one system, compared to the delivered energy from another, i can just get the time-series data, without having to worry about if sensors changed, when the network was out, when a logger was replaced with another one, etc.
  • 71. Sunday, July 24, 2011 So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
  • 72. from rows to columns Sunday, July 24, 2011 So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful way. I’m gonna walk through that process, quickly.
  • 73. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { Let’s just look at one "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011
  • 74. row-major data Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 75. “Functional” class Mass(BasicMeasure): def __init__(self, density, volume): ... self._result_func = functools.partial( lambda data, density, volume: density * volume(data) density=density, volume=volume) def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011 quasi-functional classes that describe how to calculate a value from data.
  • 76. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, A formula: E = ∆t × F #pseudocode class LoopEnergy(BasicMeasure): def __init__(self, heat_cap, delta, mass): ... def result_func(data): return self.delta(data) * self.mass(data) * self.heat_cap self._result_func = result_func def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011
  • 77. Creating a Cube For each install, for each chunk of data: apply all known formulas to get values make some convenience keys (e.g., day_of_year) stuff it in mongo Then, map/reduce to whatever dimensionalities you’re interested in: e.g., downsampling. Sunday, July 24, 2011 Here’s some pseudocode for how to make a cube of multidimensional data. So, what’s the payoff?
  • 78. How much water did [x] use, monthly? > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold": 1}).sort({“_id”: 1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 79. What were our highest production days? > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy Sold”: -1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 80. How does the distribution of [x] on the weekend compare to its distribution on the weekdays? > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}}) > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}}) > do stuff Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 81. What’s the production of installs north of a certain latitude, with a certain class of panel, on Tuesdays? For hours where the average delivered temperature delta was above [x], what was our generation efficiency? Normalize by number of panels? (map/reduce) Normalize by distance from equinox? (map/reduce) ...etc. Sunday, July 24, 2011
  • 82. • Building a cube can be done in parallel • Map/reduce is an easy way to think about transforms. • Not maximally efficient, but parallelizes on commodity hardware. Sunday, July 24, 2011 Some advantages. re #3 -- so what? It’s not a webapp.
  • 83. mongoDB: The future of enterprise business intelligence. (they just don’t know it yet) Sunday, July 24, 2011 So, here’s my thesis: document-databases are far superior to relational databases for business intelligence cases. Not only that, but mongoDB and some common sense lets you replace multimillion dollar IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
  • 85. Mongo expands in an organization. Sunday, July 24, 2011 it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot of other schema-loose data that we could use it for -- like the definitions of the measures themselves, or the details about an install, etc., etc.
  • 86. Final Thoughts Sunday, July 24, 2011 Ok, i want to close up with a few jumping-off points.
  • 87. “Business Intelligence” no longer requires megabucks Sunday, July 24, 2011
  • 88. Flexible tools means business responsiveness should be easy Sunday, July 24, 2011
  • 89. “Scaling” doesn’t just mean depth-first. Sunday, July 24, 2011 businesses grow deep, in the sense of adding more users, but they also grow broad.
  • 91. Epilogue Quest for Logging Hardware Sunday, July 24, 2011
  • 92. This’ll be easy! This is such an obvious and well explored problem space, i’m sure we’ll be able to find a solution that matches our needs without breaking the bank! Sunday, July 24, 2011
  • 93. Shopping List! 16 temperature sensors 4 flow sensors maybe some miscellaneous ones internet backhaul no software/data lock in. Sunday, July 24, 2011
  • 94. Conventions FTW! And since we’ve walked a couple convention floors and product catalogs from major industrial supply vendors, i’m sure it’s in here somewhere! Sunday, July 24, 2011
  • 95. derp derp “internet”? I’m sure there’s a reason why all of these loggers have to connect via USB... Pace Scientific XR5: 8 analog 3 pulse ONE MB no internet? $500?!? Sunday, July 24, 2011
  • 96. yay windows? ...and require proprietary (windows!) software or subscription plans that route my data through their servers (basically all of them!) Sunday, July 24, 2011
  • 97. Maybe the gov’t can help! Perhaps there’s some kind of standard that the governments require for solar thermal monitoring systems to be eligible for incentives or tax credits. Sunday, July 24, 2011
  • 98. Vive la France! An obscure standard by the Organisation Internationale de Métrologie Légale appears! Neat! Sunday, July 24, 2011
  • 99. A “Certified” Logger two temperature sensors one pulse no increase in accuracy no data backhaul -- at all ... what’s the price? Sunday, July 24, 2011
  • 102. Hmm... I can solder, and arduinos are pretty cheap Sunday, July 24, 2011
  • 104. arduino + netbook! Sunday, July 24, 2011
  • 105. TL; DR: Existing loggers are terrible. Sunday, July 24, 2011 Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
  • 106. http://www.flickr.com/photos/rknight/4358119571/ • http://4.bp.blogspot.com/_8vNzwxlohg0/ TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/ s320/turtles-all-the-way-down.jpg • http://www.flickr.com/photos/rhk313/3801302914/ • http://www.flickr.com/photos/benny_lin/481411728/ • http://spagobi.blogspot.com/ 2010_08_01_archive.html • http://community.qlikview.com/forums/t/37106.aspx Sunday, July 24, 2011