SlideShare una empresa de Scribd logo
1 de 56
Smart Grids and Big Data
Integration and Governance
Challenges and Opportunities
For Public Utilities
Smart Grid and Big Data
● There are any number of vendors and
publications stating that the IT departments of
utilities need to invest in Big Data to manage
Smart Grids.
● Saying something does not make it true, even if
you saying it very loudly and very often. It just
makes it noisy.
● Let's swap out marketing and hype for logic and
math and separate the signal from the noise.
What is a Smart Grid?
First of all, what is a “Smart Grid”? It can mean
different things to different people in a utility
depending on their perspective:
● Customer
● Distribution
● Transmission
● Generation
● Regulatory
Smart Grid – Customer Perspective
● Smart metering means going from one meter reading
per month to a reading every fifteen minutes, a 3,000-
fold increase. For every one million meters, a utility
can except to process 96 million reads per day.
● Time-Of-Day and Time-Of-Use billing considerations
are now in play
● Meters can communicate to customers but that raises
the question, who should be in charge of managing
load based on this information?
Smart Grid – Distribution Perspective
● Advanced sensoring and switching devices at
the distribution feeder level can make
distribution system automation affordable
● Selective load control
● Managing distribution generation and even
islanding
Smart Grid – Transmission Perspective
● Increase stability and control by combining phase
measurement units (PMUs) and GPS with a Supervisory
Control and Data Acquisition Unit (SCADA) at a central
control facility
● Flexible AC Transmission Systems (FACTS) are involved in
the derivation of Interconnection Reliability Operating Limits
(IROL) and require expensive monitoring devices, although
they are still less costly than building new lines
● Distributed and autonomous control could be used to
eventually create a “self-healing grid”
Smart Grid – Generation Perspective
● The Unit Commitment Problem (UCP) refers to scheduling power
generators (units) to meet electricity demand (load). Always complex
and critical, the variability of wind farms requires different algorithms.
● Additionally, the move from regulated to deregulated markets means
moving to day-ahead and real-time markets. Day-ahead requires
making a commitment based on a prior-day forecast while real-time
requres adjusting the output per unit hourly with surplus/deficit units
being traded on the Independent System Operator (ISO) market.
Smart Grid – Regulatory Perspective
● Smart Meters are considered (rightly or wrongly) to
be a significant privacy and security risk and there
will likely be a patchwork of federal and state
regulations of varying technical feasibility.
● As emerging sources of energy are developed,
there will be additional regulations.
● As new technologies are developed, there will be
additional regulations.
● Basically, there will be additional regulations.
Smart Grid – IT Perspective
● There are a lot of managers that will drive up the heat and intensity
without providing much clarity.
● There are a lot of vendors that have big ticket items that they say will fix
our problems.
● Let's step back and clearly define the problem so we can identify what
form a solution would take before we start writing checks for vaporware.
● In this next section, we'll come up with a clear problem definition and
come up with an algorithmic approach to the problem. We should at the
very least have a good idea of the Big-O of our proposed solution space.
● Once we have a framework, we can more intelligently choose an
implementation.
Smart Grid – IT Perspective
●
Smart Grid – IT Perspective
Smart Grid – Requirements
Essentially, there has been one fundamental technical change: More devices are reporting more data more frequently
● What are these “devices” again?
– smart meters
– sensors
– syncophasors
●
What do we mean by “reporting”?
– There are just devices, so each one individually is really only capable of generating a text based log file. It could be fixed or variable
length, xml or json but it will be text. Also, all of these devices now have an IP address so we will receive it on the network
somehow. Basically, we will not be getting a hand-drawn picture on microfiche (don't laugh: eGIS does).
●
What do we mean by “more”?
– We know that smart meters generate 3K time more reading intervals than traditional meters. Their payload is a lot bigger, too. We
also know that there are more sensors and control devices that are used, but we don't have hard numbers on that. Since the
company will likely grow let's call it 10,000 or 10^5.
● What do we mean by “data”?
– This type of data is called time series data: which Wikipedia tells us “is a sequence of data points, measured typically at successive
points in time spaced at uniform time intervals.”
● What do we mean by “frequently”
– For smart meters, that's typically (but not exclusively) 15 minute intervals. Sensors and syncophasors may be more or less but we
are talking about minutes. So we're not talking about days or hours anymore but at least we aren't talking about or seconds or
milliseconds. We'll call it “near-real time” or a 10^3 increase in speed (60 * 24 = 1440) .
Smart Grid – Requirements
Essentially, there has been one fundamental requirement change: provide more frequent and robust analytics
● What do we mean by “provide”
– There will need to be both ad-hoc and structured analytics and reporting. It is worth noting that data at scale is often not
amenable to the same types of reports that are used for more modest, enterprise-size data.
● What do we mean by “more frequent”?
– For most use cases, the difference between “advanced” and “standard” is speed, not detail. A generation system is
advanced if it can resolve unit commitment problems for the real-time, rather than daily, market. A transmission system is
considered advanced if it can resolve phase issues before they cause a problem. The engineering problems are well
defined by the laws of physics; we just need to be faster in order to be more reliable, effective and affordable.
●
What do we mean by “robust”?
– If we give a generation analyst access to such a deep, broad and fast pool of data, there are multiple algorithms that can
be run against that data to possibly develop new strategies for managing unit commitment in a deregulated market with a
wind farm based that could never be tried without that data.
● What do we mean by “analytics”?
– Analytics, or analysis of data, is a process of inspecting, cleaning, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision making. The major catagories or analytics
that are typically performed on time series data are on the following slide.
Time Series Data – Requirements
● Indexing
– Given a query time series Q, and some similarity/dissimilarity measure D(Q, C), find the most similar time series in database
DB
●
Clustering
– Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q, C)
●
Classification
– Given an unlabeled time series Q, assign it to one of two or more predefined classes
● Prediction
– Given a time series Q containing n data points, predict the value at time n + 1.
●
Summarization
– Given a time series Q containing n data points where n is an extremely large number, create a (possibly graphic)
approximation of Q which retains its essential features but fits on a single page, screen, etc.
● Anomoly Detection
– Given a time series Q, assumed to be normal, and an unannotated time series R, find all sections of R which contain
anomalies or “surprising/interesting/unexpected” occurrences
●
Segmentation
– Given a time series Q containing n data points, construct a model Q1
from K piecewise segments (K << n), such that Q
closely approximates Q1
Smart Grid – Requirements
Essentially, there have been some things that will not change.
● Safety
– When you have a lot of data that you never had before and are using it to ask questions you never asked before, you need
to make sure that the the answer you get makes sense.
– Be conscientious with customer's Personal Identifiable Information (PII). Write secure code: utilities get hacked.
●
Governance
– When there are fundamental changes, some of which are tied to substantial dollar amounts, there is often a push to cut
corners. It is never a good idea to be less careful when there is more at stake, but that is typical.
● Compliance
– This is a regulated industry. We are frequently audited. Architect, design and develop for transparency.
● Service Level Agreements
– The solutions used must be able to be implemented on Silver, Gold and Platinum level projects. Keep in mind that there are
enterprise-grade open source solutions and functionally worthless COTS products.
● Budgeting
– Assume no new staff and no money for training. Its just easier that way.
Smart Grid – Problem Statement
We now have our problem well defined:
Provide analytic capability 10^3 faster based on
10^5 more time series data while adhering to the
current corporate standards of safety, governance
and regulatory compliance.
Smart Grid – Problem Statement
Let's take an overview of where we'll need to make improvements:
● Provide near real-time complex analytical capability to business
– 10^3 improvement
● Based on 10^5 more time series data than currently managed
– 10^5 improvement
● While adhering to the current corporate standards of safety, governance and
regulatory compliance.
– Constant
So we need to store a lot more data in such a way that it can be accessed much
faster. Are there data stuctures and algorithms that can provide for such an
increase in time series data?
Time Series Data
● Storing time series data means taking a large
number of small files and persisting them using
the time (at a minimum) as the natural order.
● The emphasis is on insert, since there is no
mandatory prerequisite for updating or deleting
time series data
● This data will come from multiple sources so
time (and possibly location) is really the only
metric that they must have in common
Time Series Data
● What would a generic data structure look like for time series datum assuming that we need to strip out identifiable
data and the majority of the data that we can preserve cannot be guaranteed to be present in all device types*?
● To make life a little easier, assume the object needs to be queried in the following manner:
– Range queries which specify value conditions, such as: select from dataset where sales.December [1 million, 1.2 million]∗ ∈
– Pattern matching queries which rely on the definition of pattern similarity, such as: Given time series q, select r f rom dataset where
similarity(r, q) > threshold (or distance(r, q) < δ)
– Identity queries which rely on the avalability of an equality operator
● In order to rapidly query time series data, there needs to be a clear and relevant sequential-based index
● A reasonable key could composed of metric name : timestamp : random 64 bit integer : geo tag. If you needed to
retrieve a block of data that would provide all of the smart meters readings in a High Consequence Area in Ohio in the
third quarter of 2014, that would certainly be readily available.
● Note that this key also represent a strong hashing function. Since a hash table provides the fastest possible data
retrieval (constant time), it is very important to ensure that the hash is well generated. A bad hash degraded a hash
table to a linked list and we will get data in O(n) rather than O(1).
* ANSI C12.19 is a specification; not an implementation
Time Series Data
Time series data lends itself to heavy disk I/O. This is not good news.
●
Seek Rate
– Measure of time it takes for data to be written or read to disk. This is 20ms for HDDs and 0.1ms for SSDs.
For HDDs, this 20ms estimate is variable since the performace of the arm is non-linear because the head
may need to move to multiple locations to read a single file.
● Transfer Rate
– Measure of time is takes for to move data between the controller and the host system (external rate) as
well as between the disk surface and the controller (internal rate). This can be 100MB/sec for HDD
As far as disk operations are concerned, it is better to transfer than to seek and even then it's
helpful to minimize the frequency of the transfers in favor of larger payloads. Since the data is
both sequential and immutable, we can have a reasonable expectation that this can be
optimized.
The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds (1E-03), so it
makes sense to short-circuit any deep conversations about partial stroking and hybrid drives.
Time Series Data – Data Structure
We need a storage system that
● minimizes seek time
● optimized for index searches
● optimized for sequential searches
● optimized for inserts
Time Series Data – Data Structure
● Different database engines use different
database structures including:
– B-trees (or B+ trees)
– Log Structured Merge Trees
– Hash table
– Graphs
Time Series Data – Data Structure
● B+ Trees are balanced trees like B-Trees.
● They are different from B-Trees in that they maintain a doubly
linked list to connect each leaf node to its sibling to speed up
growth and contraction
● The primary value of a B+ tree is in storing data for efficient
retrieval in a block-oriented storage context. Most file systems
use this structure and it is used by every major relational
database vendor for their key indexes.
● B+ trees are great for random access
● All B-Trees are typically shallow but wide, so there are a
minimal number of seeks.
Time Series Data – Data Structure
● When inserting a record into a B+-Tree, you need to search the
tree to find the location to insert the record. Depending on whether
or not the tree is full, you may or may not need to split the tree to
make room for the new node. The doubly linked lists help here.
● Since B-Trees are designed to be wide and shallow, there should
be a minimal number of drive seeks.
– From a practical point of view, B-trees, therefore, guarantee an access
time of less than 10 ms even for extremely large datasets.
● Dr. Rudolf Bayer, inventor of the B-tree
● B+ Tree inserts are O(log n), which can be argued is the
mathematical lower bound for balanced trees.
● So can we do better?
Time Series Data – Data Structure
● There are two key potential areas of
improvement for B+ Trees that are applicable for
systems that are going to do large quantities of
sequential writes:
– Moving seek time from disk to memory
– Moving from block data to log structured data
● The data structure is called a Log Structured
Merge Tree (LSM) and the storage model is
called Log Structured Storage (LSS)
Time Series Data – Data Structure
LSM Tree
● An LSM-Tree is a hybrid tree model that uses two trees: C0
and C1. C0 is smaller and entirely resident in memory,
whereas C1 is resident on disk. New records are written to
C0 from C1 based on a size threshold.
● Insertions now run primarily at RAM rather than HDD speeds,
or 1E-09 rather than 1E-03 seconds. Of course, they are
written to disk, but that is where the LSS comes in.
● Note that many production systems systems concurrently
write to a commit log on disk and C0 with the commit log
getting deleted after flushing.
Time Series Data – Data Structure
LSS
● In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured storage
system, this overhead does not exist because a log structured storage system
provides for an append-only sequence of data entries. Unlike a B+ Tree-based
system, you don't find a location for new data, you merely append it to the end.
● Because new records are always added to the end, there is never any need for
searching a tree for insertion, like in a B-tree storage structure. This allows for
extremely predictable horizontal scaling.
● Providing concurrency and transactional semantics using Multiversion Concurrency
and Control (MVCC) is easier in LSS than B-tree since existing data in not modified.
A view of the system at state Q at time A is just as valid is a view of Q at time B.
● Note that a POSIX-compliant file system typically defines a block size as 64KB. The
HDFS file system was originally designed with 64MB block sizes but is often
configured for 128MB block sizes.
Time Series Data – Data Structure
LSS
● In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured
storage system, this overhead does not exist because a log structured
storage system provides for an append-only sequence of data entries. Unlike
a B+ Tree-based system, you don't find a location for new data, you merely
append it to the end.
● Because new records are always added to the end, there is never any need
for searching a tree for insertion, like in a B-tree storage structure. This allows
for extremely predictable horizontal scaling.
● Providing concurrency and transactional semantics using Multiversion
Concurrency and Control (MVCC) is easier in LSS than B-tree since existing
data in not modified. A view of the system at state Q at time A is just as valid
is a view of Q at time B.
Time Series Data – Data Structure
Hash Table
– It is impossible to get better performance [O(1)] in any data structure when using a
good hash that minimizes, or optimally avoids, collisions, or the best case. With the
worst case, such as a hash that results in a lot of collisions, the worst case is [O(n)].
– There are drawback to hash tables, particularly with dynamic resizing so most
database systems tend to go with a data structure that has a logarithmic best and
worst case [O(log n)] in fear of linear time performance.
– However, if you can take sequential data from a database and load it in memory and
access it using a hash table, there are operations that can be performed that are not
just faster, but would be impossible otherwise.
Time Series Data – Compliance
● Having defined the optimal data structure, what is the require consistency model for the data?
● The CAP Theorem
– Consistency
All clients see the same view of the data, even in the presence of updates
– Availability
All clients can find some replica of the data, even in the presence of failure
– Partition Tolerance
The system property holds even if the system is partitioned.
Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem
is to identify the consistency model you need.
●
For time series data from devices, availability and partition tolerance are key drivers. The data should never be lost
and the system should not be unavailable, but a combination of near real-time access and a lack of hard relationships
among rows make this the logical choice. The data should be partitioned across multiple based on a reasonable has
in order to avoid the hot-spotting problem that can arise with time series data indexes. This is an AP model.
● For example, consider a banking system. If a customer makes a transfer from checking to savings, that anyone who
looks at that data sees the same result. This would not be the case if the check and savings account were separately
partitioned, so this a CA model.
● By the way, relational databases are all CA and CA is the only way to be ACID. NoSQL databases are either CP or AP.
Time Series Data – Compliance
Time Series Data – Data Structure
To summarize,
●
The best data structure for inserting into our persistent storage engine for time
series data would run in O(log n) or logarithmic time.
●
B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM
Tree will deliver better performance for inserting time series data.
●
Proper configuration of the LSM Tree engine could move a substantial amount
of the operations from disk to memory (1E-09 rather than 1E-03)
● The best data structure for reading from our persistent storage engine would
run in O(1) or constant time.
● The consistency model will need to be AP.
●
Since we need to process 1E05 time more data, moving as much processing
from 1E-03 to 1E-09 will absolutely get us there.
Time Series Data – Processing
To process a request from an analyst, we will need to:
– retrieve the data from the data store
– store the data in RAM
– perform calculations on the data
– return the result
● All in near -real time, which we have identified as between
1 and 15 minutes
● We will need to get bigger and faster, but what does
bigger and faster really mean?
Time Series Data – Processing
● Scale up
– Hardware perspective:
● Adding more CPU to increase computational performance
● Adding RAM to increase query and data caching
● Adding more storage such as SSDs and partitioning various I/O processes to different physical disks
– Database perspective:
● Replication techniques to enable various applications to connect to replica databases and eliminate
computational and I/O constraints on the master
● Clustering configurations to handle failover and availability concerns
● Vertical and horizontal database partitioning to optimize query performance
● Scale Out
– Sharding/Federating
– Horizontal partitioning (separating operational and analytical)
– Hadoop cluster
Time Series Data – Processing
●
So how much data are we talking about?
– Ballpark, 10 million devices (smart meters, etc) will generate 1 billion reads per day.
If each read is 1KB, we've already hit the TB mark. Let's say that we are talking
somewhere in the TB to PT range within the forseeable future.
●
How do you choose to scale up or scale out? Base it on the size of the data
that needs to be loaded into RAM.
– With transfer times at 100MB/sec, it'll take 17 minutes to move 100GB of data from
disk to RAM, which breaks the 15 minute mark.
– It would appear unlikely; however, that a single analyst would process 100GB of time
series data at a time and its even unlikely that a group of analysts would do so.
Time Series Data – Summary
● The goal was to provide 10^5 more time series data to
analytic jobs running 10^5 times faster.
● Storing the time series data would optimally use an LSM
Tree, preferably on an file system that provides for
large, read-only block sizes.
● Reading the time series data would optimally use an in-
memory hash table.
● Estimated job sizes would likely require GB of memory
due to geographic and temporal realities of time series
data from physical devices.
Smart Grid – IT Perspective
So, wait, do we have a Big Data problem?
Big Data – 3Vs
Big Data problems are traditionally defined using one or more of the following metrics:
Volume : Volume = rows / objects / bytes
Volume refers to the size of the data to be processed. Is the size of your enterprise data limited by cost or by potential business
opportunity? What could you do with fast, cheap, predictable data growth?
Big Data is any data that is expensive to manage and hard to extract value from.
Michael Franklin, Director of Algorithms, Machines and People Lab, University of Berkeley
Velocity : Velocity = number of rows / bytes per unit time
Velocity refers to the latency of data processing relative to the growing demand for interactivity. How much more responsive could your
business be if jobs that used to run overnight can now be run on demand? On a data set an order of magnitude larger? For less money?
Real-time big data isn't just a process for storing petabytes or exabytes of data in a data warehouse. It's about combining and analyzing data so you can
take the right action, at the right time, in the right place.
Michael Minelli, Big Data, Big Analytics
Variety : Variety = number of columns / dimensions / sources
Variety refers to the diversity of sources, formats, quality and structures. Business value should not depend on a strict data model.
Unstructured data and semi-structured data, used properly, can yield valuable and actionable insights.
...no greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent
data semantics
Doug Laney, 3-D Data Management: Controlling Data Volume, Velocity and Variety, Gartner 2001
Smart Grid – 3Vs
Volume : Volume = rows / objects / bytes
The volume of data that utility companies will face is not in the same league as some data-
intensive companies, such as Facebook, the move from monthly meter reads to 15 minute scalar
reads means a 3,000-fold increase in data, so your existing system will not easily manage such an
order of magnitude increase.
Variety : Variety = number of columns / dimensions / sources
Data from industrial control systems will be joined by unstructured and semi-structured data from
weather forecasting systems, security cameras, maps, drawings, pictures, call center logs, social
media and other web resources. An intelligent use of data munging can provide an integrated data
pool from which to inform planning processes and decision making.
Velocity : Velocity = number of rows / bytes per unit time
New grid instrumentation and sensors generate relatively large amounts of streaming data and
there are substantial benefits to be had if this data can be processed in real time for equipment
reliability monitoring, outage prevention and security monitoring.
Duke Energy – 3Vs
For Duke, in order of priority:
– Velocity
– Variety
– Volume
IT Perspective – Action Alternatives
● So what's next?
● There are esentially three basic strategies that
you can use when confronted with a well
defined problem:
– Do we need to take any action at all?
– Can we take evolutionary steps?
– Do we need to take revolutionary measures?
IT Perspective – No Action
● Not taking any action at all would mean storing
the Time Series Data in a relational database
(Oracle or SQL Server or DB). They use a B+
Tree index and we really want an LSM Tree
index.
● But IBM has a product called IBM Informix
TimeSeries for Meter Data Management.
Wouldn't that work?
● ;)
IT Perspective – Revolutionary Action
● A revolutionary step would be to start throwing
away existing platforms and processes and
jumping into the new with both feet.
● That would only be a good idea if you were a
start-up or if you were experimenting with a
new line of business.
● The meter is the cash register of the utility.
Revolutionary is inappropriate.
IT Perspective – Evolutionary Action
● Evolutionary steps involve taking the smallest
actions that would get the biggest result.
● Change one small thing, see what happens,
repeat.
● It also doesn't hurt if we aren't the first people to
do this since we can learn from other people.
● The natural first move would be to adopt a
different data storage model.
Big Data – Why?
● At scale, it has become apparent that one size
does not fit all when it comes to data
management.
● Different approaches were tried to make RDMS
work with Big Data (ex sharding, denormalizing,
etc).
● The hacks are typically operational nightmares
and eventually new databases, collectively called
NoSQL, were developed.
Big Data – Why Not?
There are four basic enterprise arguments
against NoSQL should be addressed:
– No ACID equals No Go
– SQL is mandatory
– NoSQL means NoStandards
– NoSQL is for Startups
No ACID equals No Go
Critique: Mission critical data must be Atomic,
Consistent, Isolated and Durable
Response: Of course it should. Customer billing
information needs Consistency and Availability (CA)
and should therefore be stored in an RDBMS.
Customer behavior patterns from your web site and
call center logs; however, need not be treated in the
same manner. No one in the NoSQL community is
arguing for a replacement of RDBMS, just additional
options.
SQL is mandatory
Critique: Low-level query languages (ex CODASYL) have
never found support and SQL is a common language
among business users and developers.
Response: This is a great point and that's why so much
effort has been put into Apache Hive and Cloudera Impala
and DataStax CQL to provide a SQL front-end to Hadoop.
Unfortunately, NoSQL is a name that has stuck and drew a
line in the sand that is actually not there.
NoSQL means NoStandards
Critique: Large enterprises may have thousands of databases. These need accepted
standards.
Response: This critique typically falls apart on its own.
Some of us remember something similar being leveled against the upstart relational
model by the mainframe people. Ask yourself if your enterprise has a consistent set of
standards that apply to all databases across the enterpise. Just for fun, try to find out
how many different ways customer addresses are stored.
Also, how do you define 'database'? Most likely not as 'some entity that contains
valuable business data'. That would then include emails, spreadsheets, documents, call
logs, images, log files (internet and device), and even external sources such as social
media. Also, databases do not spring up fully formed and compliant. They are the result
of architectural design and implementation discipline. If you apply the same discipline to
developing your NoSQL schemas, then you have standards.
NoSQL is for Startups
Critique: Startups can use NoSQL because they are too new to have data structures.
Established companies have established data structures.
Response: This is true to a point. When you are a startup most; but not all, of your use
cases are edge cases. A startup's billing system needs are probably similar to that of an
established company, but we'll leave that aside. The fact of the matter is that the
business landscape has changed and not all of these changes can be managed by the
traditional corporate OLTP system. The explosion not only of the internet but to an even
greater degree mobile devices has presented opportunities that few industries can
ignore.
The real numbers behind the adoption of NoSQL technologies tell the real story. More
than half of the Fortune 500 companies today have a significant investment in Big Data
and NoSQL related technologies.
Scalability: CAP Theorem
● Consistency
Commits are available across entire distributed system
● Availability
System remains accessible and operational at all times
● Partition Tolerance
Only a total system failure can cause the system to respond incorrectly
Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the
CAP Theorem is to identify the consistency model you need.
● CA
Traditional relational databases
● AP
Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak
● CP
BigTable-like systems, MongoDB, HBase, Memcached, Redis
High-Volume Data Management
Big Data – IT Perspective
● We have now identified that we want to make the
smallest change that has the biggest impact and that
would be moving our underlying data structure for
time series data from a B+ Tree to an LSM Tree.
● We have identified AP as our consistency model for
scalable data.
● We have seen that Amazon's Dynamo data model
most closely fits this model and we have identified
three applications based on this model.
IT Perspective – Apply Invariants
● The final step is to apply our invariants to our solution :
adhering to the current corporate standards of safety,
governance and regulatory compliance.
● All three are open-source projects and all three have an
associated corporate entity, Cassandra has DataStax and
Riak has Basho and CouchDB has CouchBase.
● Only DataStax and Basho offer an enterprise-grade
distribution that would be appropriate for utilities.
Comparison of Riak and Cassandra from
Enterpise Development Framework
● Cut to the chase: Riak is written in Erlang and Cassandra is written in
Java. You can even use Spring with Cassandra.
● One other interesting point is that Cassandra offers tunable
consistency. This means that is you have a use case that needs a
different consistency model than the AP model I described, then use
Cassandra. It will allow you to blur the distinctions between strictly AP
and strictly CP.
● From a development perspective, Cassandra is likely the most
straightforward LSM datastore to integrate into a current framework.
● Erlang? Really?
Comparison of Riak and Cassandra from
Enterpise Architect Framework
● Between DataStax Enterprise and DataStax OpsCenter, DataStax
can be the single vendor for data (Cassandra), analytics (hadoop)
and search (Solr).
● DataStax offers enterprise grade auditing and security capabilities.
● DataStax has no single point of failure. Not a failover scenario; no
SPOF. This is a huge difference.
● DataStax offers actual horizontal scalability. Replication and
sharding of a B+ Tree datastore are not truly horizontal.

Más contenido relacionado

La actualidad más candente

Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)IT Arena
 
Lift 2016 - Denis Slieker's slides
Lift 2016 - Denis Slieker's slidesLift 2016 - Denis Slieker's slides
Lift 2016 - Denis Slieker's slidesFing
 
Generating Insight from Big Data in Energy and the Environment
Generating Insight from Big Data in Energy and the EnvironmentGenerating Insight from Big Data in Energy and the Environment
Generating Insight from Big Data in Energy and the EnvironmentDavid Wallom
 
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Hendy Irawan
 
The Soft Grid 2013 Opening Presentation
The Soft Grid 2013 Opening PresentationThe Soft Grid 2013 Opening Presentation
The Soft Grid 2013 Opening PresentationGTMevents
 
Big Data for Utilities
Big Data for UtilitiesBig Data for Utilities
Big Data for UtilitiesDale Butler
 
UtiliAPP - Utility Analytics - Indigo Advisory Group
UtiliAPP  - Utility Analytics - Indigo Advisory GroupUtiliAPP  - Utility Analytics - Indigo Advisory Group
UtiliAPP - Utility Analytics - Indigo Advisory GroupIndigo Advisory Group
 
Transformation Tools for Utilities - Indigo Advisory Group
Transformation Tools for Utilities - Indigo Advisory GroupTransformation Tools for Utilities - Indigo Advisory Group
Transformation Tools for Utilities - Indigo Advisory GroupIndigo Advisory Group
 
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_Oct
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_OctCompegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_Oct
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_OctCOMPEGENCE
 
Business Intelligence and Data Analytics in Renewable Energy Sector
Business Intelligence and Data Analytics in Renewable Energy SectorBusiness Intelligence and Data Analytics in Renewable Energy Sector
Business Intelligence and Data Analytics in Renewable Energy SectorDarshit Paun
 
Artificial intelligence in Energy and Utilities – Market Overview
Artificial intelligence in Energy and Utilities – Market OverviewArtificial intelligence in Energy and Utilities – Market Overview
Artificial intelligence in Energy and Utilities – Market OverviewIndigo Advisory Group
 
Big Data big deal big business for utilities vesion 01
Big Data big deal big business for utilities vesion 01Big Data big deal big business for utilities vesion 01
Big Data big deal big business for utilities vesion 01Marc Govers
 
Big Data Analytics Transforms Utilities and Cities
Big Data Analytics Transforms Utilities and CitiesBig Data Analytics Transforms Utilities and Cities
Big Data Analytics Transforms Utilities and CitiesBlack & Veatch
 
Energy Industry Trends by Jonathan Tan, GZZ Cleantech Consulting
Energy Industry Trends  by Jonathan Tan, GZZ Cleantech ConsultingEnergy Industry Trends  by Jonathan Tan, GZZ Cleantech Consulting
Energy Industry Trends by Jonathan Tan, GZZ Cleantech ConsultingJonathan L. Tan, M.B.A.
 
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits Areas
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits AreasSmart Grid deployments in the US - Lessons Learned and Emerging Benefits Areas
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits AreasDavid Groarke
 
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group Indigo Advisory Group
 
Energy Data Analytics | Energy Efficiency | India
Energy Data Analytics | Energy Efficiency | IndiaEnergy Data Analytics | Energy Efficiency | India
Energy Data Analytics | Energy Efficiency | IndiaUmesh Bhutoria
 
Indigo Capability Primer - Transformation Tools for Utilities
Indigo Capability Primer - Transformation Tools for Utilities Indigo Capability Primer - Transformation Tools for Utilities
Indigo Capability Primer - Transformation Tools for Utilities Indigo Advisory Group
 

La actualidad más candente (20)

Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
 
Lift 2016 - Denis Slieker's slides
Lift 2016 - Denis Slieker's slidesLift 2016 - Denis Slieker's slides
Lift 2016 - Denis Slieker's slides
 
Generating Insight from Big Data in Energy and the Environment
Generating Insight from Big Data in Energy and the EnvironmentGenerating Insight from Big Data in Energy and the Environment
Generating Insight from Big Data in Energy and the Environment
 
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
 
The Soft Grid 2013 Opening Presentation
The Soft Grid 2013 Opening PresentationThe Soft Grid 2013 Opening Presentation
The Soft Grid 2013 Opening Presentation
 
Big Data for Utilities
Big Data for UtilitiesBig Data for Utilities
Big Data for Utilities
 
UtiliAPP - Utility Analytics - Indigo Advisory Group
UtiliAPP  - Utility Analytics - Indigo Advisory GroupUtiliAPP  - Utility Analytics - Indigo Advisory Group
UtiliAPP - Utility Analytics - Indigo Advisory Group
 
Transformation Tools for Utilities - Indigo Advisory Group
Transformation Tools for Utilities - Indigo Advisory GroupTransformation Tools for Utilities - Indigo Advisory Group
Transformation Tools for Utilities - Indigo Advisory Group
 
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_Oct
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_OctCompegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_Oct
Compegence: Dr. Abhinanda Sarkar - Energy Analytics_IISC_2012_Oct
 
Business Intelligence and Data Analytics in Renewable Energy Sector
Business Intelligence and Data Analytics in Renewable Energy SectorBusiness Intelligence and Data Analytics in Renewable Energy Sector
Business Intelligence and Data Analytics in Renewable Energy Sector
 
Artificial intelligence in Energy and Utilities – Market Overview
Artificial intelligence in Energy and Utilities – Market OverviewArtificial intelligence in Energy and Utilities – Market Overview
Artificial intelligence in Energy and Utilities – Market Overview
 
Smart Grid Deployment Experience and Utility Case Studies
Smart Grid Deployment Experience and Utility Case StudiesSmart Grid Deployment Experience and Utility Case Studies
Smart Grid Deployment Experience and Utility Case Studies
 
Big Data Analytics at Vestas Wind Systems
Big Data Analytics at Vestas Wind SystemsBig Data Analytics at Vestas Wind Systems
Big Data Analytics at Vestas Wind Systems
 
Big Data big deal big business for utilities vesion 01
Big Data big deal big business for utilities vesion 01Big Data big deal big business for utilities vesion 01
Big Data big deal big business for utilities vesion 01
 
Big Data Analytics Transforms Utilities and Cities
Big Data Analytics Transforms Utilities and CitiesBig Data Analytics Transforms Utilities and Cities
Big Data Analytics Transforms Utilities and Cities
 
Energy Industry Trends by Jonathan Tan, GZZ Cleantech Consulting
Energy Industry Trends  by Jonathan Tan, GZZ Cleantech ConsultingEnergy Industry Trends  by Jonathan Tan, GZZ Cleantech Consulting
Energy Industry Trends by Jonathan Tan, GZZ Cleantech Consulting
 
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits Areas
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits AreasSmart Grid deployments in the US - Lessons Learned and Emerging Benefits Areas
Smart Grid deployments in the US - Lessons Learned and Emerging Benefits Areas
 
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group
UtiliGRIDMOD - Utility Grid Modernization - Indigo Advisory Group
 
Energy Data Analytics | Energy Efficiency | India
Energy Data Analytics | Energy Efficiency | IndiaEnergy Data Analytics | Energy Efficiency | India
Energy Data Analytics | Energy Efficiency | India
 
Indigo Capability Primer - Transformation Tools for Utilities
Indigo Capability Primer - Transformation Tools for Utilities Indigo Capability Primer - Transformation Tools for Utilities
Indigo Capability Primer - Transformation Tools for Utilities
 

Destacado

Building Blockchain Business _092016
Building Blockchain Business _092016Building Blockchain Business _092016
Building Blockchain Business _092016getmarwesselink
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
BlockChain Strategists - English presentation
BlockChain Strategists - English presentationBlockChain Strategists - English presentation
BlockChain Strategists - English presentationBlockChain Strategists
 
State of Blockchain Q4 2016
State of Blockchain Q4 2016State of Blockchain Q4 2016
State of Blockchain Q4 2016CoinDesk
 
CBGTBT - Part 1 - Workshop introduction & primer
CBGTBT - Part 1 - Workshop introduction & primerCBGTBT - Part 1 - Workshop introduction & primer
CBGTBT - Part 1 - Workshop introduction & primerBlockstrap.com
 
Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15Calvin Turner
 
Digital Bank, May 2014
Digital Bank, May 2014Digital Bank, May 2014
Digital Bank, May 2014Chris Skinner
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017LinkedIn
 

Destacado (10)

IoT underthe hood
IoT underthe hoodIoT underthe hood
IoT underthe hood
 
Building Blockchain Business _092016
Building Blockchain Business _092016Building Blockchain Business _092016
Building Blockchain Business _092016
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
BlockChain Strategists - English presentation
BlockChain Strategists - English presentationBlockChain Strategists - English presentation
BlockChain Strategists - English presentation
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
State of Blockchain Q4 2016
State of Blockchain Q4 2016State of Blockchain Q4 2016
State of Blockchain Q4 2016
 
CBGTBT - Part 1 - Workshop introduction & primer
CBGTBT - Part 1 - Workshop introduction & primerCBGTBT - Part 1 - Workshop introduction & primer
CBGTBT - Part 1 - Workshop introduction & primer
 
Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15Digital Banking Strategy Roadmap - 3.24.15
Digital Banking Strategy Roadmap - 3.24.15
 
Digital Bank, May 2014
Digital Bank, May 2014Digital Bank, May 2014
Digital Bank, May 2014
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 

Similar a Smart Grids and Big Data

"Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros..."Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros...Fwdays
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...mattdenesuk
 
Managing the Meter Shop of the Future Through Better Tools and Information
Managing the Meter Shop of the Future Through Better Tools and InformationManaging the Meter Shop of the Future Through Better Tools and Information
Managing the Meter Shop of the Future Through Better Tools and InformationTESCO - The Eastern Specialty Company
 
A practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicA practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicVeselin Pizurica
 
Network performance - skilled craft to hard science
Network performance - skilled craft to hard scienceNetwork performance - skilled craft to hard science
Network performance - skilled craft to hard scienceMartin Geddes
 
Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?MongoDB
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8Navaid Khan
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Navaid Khan
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data PlatformNavaid Khan
 
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-Computing
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-ComputingRedefining-Smart-Grid-Architectural-Thinking-Using-Stream-Computing
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-ComputingAjoy Kumar
 
Touch IoT with SAP Leonardo MAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...
Touch IoT with SAP LeonardoMAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...Touch IoT with SAP LeonardoMAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...
Touch IoT with SAP Leonardo MAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...Sanjeev Chandrasekaran
 
From Visibility to Value
From Visibility to ValueFrom Visibility to Value
From Visibility to Valueaccenture
 
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET Journal
 

Similar a Smart Grids and Big Data (20)

"Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros..."Using Multi-Master data replication for the parallel-run refactoring", Myros...
"Using Multi-Master data replication for the parallel-run refactoring", Myros...
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
 
Managing the Meter Shop of the Future Through Better Tools and Information
Managing the Meter Shop of the Future Through Better Tools and InformationManaging the Meter Shop of the Future Through Better Tools and Information
Managing the Meter Shop of the Future Through Better Tools and Information
 
Modern Software Architectures - Overview
Modern Software Architectures - Overview Modern Software Architectures - Overview
Modern Software Architectures - Overview
 
GE-GridIQ_Insight
GE-GridIQ_InsightGE-GridIQ_Insight
GE-GridIQ_Insight
 
Challenges for Meter Shop Operations of the Future
Challenges for Meter Shop Operations of the FutureChallenges for Meter Shop Operations of the Future
Challenges for Meter Shop Operations of the Future
 
A practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicA practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logic
 
Demystifying internet of things
Demystifying internet of thingsDemystifying internet of things
Demystifying internet of things
 
Zero Touch Analytics
Zero Touch AnalyticsZero Touch Analytics
Zero Touch Analytics
 
Network performance - skilled craft to hard science
Network performance - skilled craft to hard scienceNetwork performance - skilled craft to hard science
Network performance - skilled craft to hard science
 
FINAL VER - 2015_09
FINAL VER - 2015_09FINAL VER - 2015_09
FINAL VER - 2015_09
 
Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?
 
Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8
 
Hitachi Streaming Data Platform
Hitachi Streaming Data PlatformHitachi Streaming Data Platform
Hitachi Streaming Data Platform
 
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-Computing
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-ComputingRedefining-Smart-Grid-Architectural-Thinking-Using-Stream-Computing
Redefining-Smart-Grid-Architectural-Thinking-Using-Stream-Computing
 
Touch IoT with SAP Leonardo MAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...
Touch IoT with SAP LeonardoMAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...Touch IoT with SAP LeonardoMAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...
Touch IoT with SAP Leonardo MAINTENANCE AND SERVICE MANAGEMENT FOR PEDESTRIAN...
 
From Visibility to Value
From Visibility to ValueFrom Visibility to Value
From Visibility to Value
 
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
 
Big Data and Business Insight
Big Data and Business InsightBig Data and Business Insight
Big Data and Business Insight
 

Más de Dave Callaghan

Big Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLTBig Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLTDave Callaghan
 
Stormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and PentahoStormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and PentahoDave Callaghan
 
MongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, ImproveMongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, ImproveDave Callaghan
 
MongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, ImproveMongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, ImproveDave Callaghan
 
Orphans in the Desert Presentation
Orphans in the Desert PresentationOrphans in the Desert Presentation
Orphans in the Desert PresentationDave Callaghan
 

Más de Dave Callaghan (9)

Big Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLTBig Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLT
 
Stormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and PentahoStormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and Pentaho
 
MongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, ImproveMongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, Improve
 
MongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, ImproveMongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, Improve
 
SegmentOfOne
SegmentOfOneSegmentOfOne
SegmentOfOne
 
BigFastData
BigFastDataBigFastData
BigFastData
 
Orphans in the Desert Presentation
Orphans in the Desert PresentationOrphans in the Desert Presentation
Orphans in the Desert Presentation
 
AtlasCHUG
AtlasCHUGAtlasCHUG
AtlasCHUG
 
BigDataInTelco
BigDataInTelcoBigDataInTelco
BigDataInTelco
 

Último

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

Smart Grids and Big Data

  • 1. Smart Grids and Big Data Integration and Governance Challenges and Opportunities For Public Utilities
  • 2. Smart Grid and Big Data ● There are any number of vendors and publications stating that the IT departments of utilities need to invest in Big Data to manage Smart Grids. ● Saying something does not make it true, even if you saying it very loudly and very often. It just makes it noisy. ● Let's swap out marketing and hype for logic and math and separate the signal from the noise.
  • 3. What is a Smart Grid? First of all, what is a “Smart Grid”? It can mean different things to different people in a utility depending on their perspective: ● Customer ● Distribution ● Transmission ● Generation ● Regulatory
  • 4. Smart Grid – Customer Perspective ● Smart metering means going from one meter reading per month to a reading every fifteen minutes, a 3,000- fold increase. For every one million meters, a utility can except to process 96 million reads per day. ● Time-Of-Day and Time-Of-Use billing considerations are now in play ● Meters can communicate to customers but that raises the question, who should be in charge of managing load based on this information?
  • 5. Smart Grid – Distribution Perspective ● Advanced sensoring and switching devices at the distribution feeder level can make distribution system automation affordable ● Selective load control ● Managing distribution generation and even islanding
  • 6. Smart Grid – Transmission Perspective ● Increase stability and control by combining phase measurement units (PMUs) and GPS with a Supervisory Control and Data Acquisition Unit (SCADA) at a central control facility ● Flexible AC Transmission Systems (FACTS) are involved in the derivation of Interconnection Reliability Operating Limits (IROL) and require expensive monitoring devices, although they are still less costly than building new lines ● Distributed and autonomous control could be used to eventually create a “self-healing grid”
  • 7. Smart Grid – Generation Perspective ● The Unit Commitment Problem (UCP) refers to scheduling power generators (units) to meet electricity demand (load). Always complex and critical, the variability of wind farms requires different algorithms. ● Additionally, the move from regulated to deregulated markets means moving to day-ahead and real-time markets. Day-ahead requires making a commitment based on a prior-day forecast while real-time requres adjusting the output per unit hourly with surplus/deficit units being traded on the Independent System Operator (ISO) market.
  • 8. Smart Grid – Regulatory Perspective ● Smart Meters are considered (rightly or wrongly) to be a significant privacy and security risk and there will likely be a patchwork of federal and state regulations of varying technical feasibility. ● As emerging sources of energy are developed, there will be additional regulations. ● As new technologies are developed, there will be additional regulations. ● Basically, there will be additional regulations.
  • 9. Smart Grid – IT Perspective ● There are a lot of managers that will drive up the heat and intensity without providing much clarity. ● There are a lot of vendors that have big ticket items that they say will fix our problems. ● Let's step back and clearly define the problem so we can identify what form a solution would take before we start writing checks for vaporware. ● In this next section, we'll come up with a clear problem definition and come up with an algorithmic approach to the problem. We should at the very least have a good idea of the Big-O of our proposed solution space. ● Once we have a framework, we can more intelligently choose an implementation.
  • 10. Smart Grid – IT Perspective ●
  • 11. Smart Grid – IT Perspective
  • 12. Smart Grid – Requirements Essentially, there has been one fundamental technical change: More devices are reporting more data more frequently ● What are these “devices” again? – smart meters – sensors – syncophasors ● What do we mean by “reporting”? – There are just devices, so each one individually is really only capable of generating a text based log file. It could be fixed or variable length, xml or json but it will be text. Also, all of these devices now have an IP address so we will receive it on the network somehow. Basically, we will not be getting a hand-drawn picture on microfiche (don't laugh: eGIS does). ● What do we mean by “more”? – We know that smart meters generate 3K time more reading intervals than traditional meters. Their payload is a lot bigger, too. We also know that there are more sensors and control devices that are used, but we don't have hard numbers on that. Since the company will likely grow let's call it 10,000 or 10^5. ● What do we mean by “data”? – This type of data is called time series data: which Wikipedia tells us “is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals.” ● What do we mean by “frequently” – For smart meters, that's typically (but not exclusively) 15 minute intervals. Sensors and syncophasors may be more or less but we are talking about minutes. So we're not talking about days or hours anymore but at least we aren't talking about or seconds or milliseconds. We'll call it “near-real time” or a 10^3 increase in speed (60 * 24 = 1440) .
  • 13. Smart Grid – Requirements Essentially, there has been one fundamental requirement change: provide more frequent and robust analytics ● What do we mean by “provide” – There will need to be both ad-hoc and structured analytics and reporting. It is worth noting that data at scale is often not amenable to the same types of reports that are used for more modest, enterprise-size data. ● What do we mean by “more frequent”? – For most use cases, the difference between “advanced” and “standard” is speed, not detail. A generation system is advanced if it can resolve unit commitment problems for the real-time, rather than daily, market. A transmission system is considered advanced if it can resolve phase issues before they cause a problem. The engineering problems are well defined by the laws of physics; we just need to be faster in order to be more reliable, effective and affordable. ● What do we mean by “robust”? – If we give a generation analyst access to such a deep, broad and fast pool of data, there are multiple algorithms that can be run against that data to possibly develop new strategies for managing unit commitment in a deregulated market with a wind farm based that could never be tried without that data. ● What do we mean by “analytics”? – Analytics, or analysis of data, is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. The major catagories or analytics that are typically performed on time series data are on the following slide.
  • 14. Time Series Data – Requirements ● Indexing – Given a query time series Q, and some similarity/dissimilarity measure D(Q, C), find the most similar time series in database DB ● Clustering – Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q, C) ● Classification – Given an unlabeled time series Q, assign it to one of two or more predefined classes ● Prediction – Given a time series Q containing n data points, predict the value at time n + 1. ● Summarization – Given a time series Q containing n data points where n is an extremely large number, create a (possibly graphic) approximation of Q which retains its essential features but fits on a single page, screen, etc. ● Anomoly Detection – Given a time series Q, assumed to be normal, and an unannotated time series R, find all sections of R which contain anomalies or “surprising/interesting/unexpected” occurrences ● Segmentation – Given a time series Q containing n data points, construct a model Q1 from K piecewise segments (K << n), such that Q closely approximates Q1
  • 15. Smart Grid – Requirements Essentially, there have been some things that will not change. ● Safety – When you have a lot of data that you never had before and are using it to ask questions you never asked before, you need to make sure that the the answer you get makes sense. – Be conscientious with customer's Personal Identifiable Information (PII). Write secure code: utilities get hacked. ● Governance – When there are fundamental changes, some of which are tied to substantial dollar amounts, there is often a push to cut corners. It is never a good idea to be less careful when there is more at stake, but that is typical. ● Compliance – This is a regulated industry. We are frequently audited. Architect, design and develop for transparency. ● Service Level Agreements – The solutions used must be able to be implemented on Silver, Gold and Platinum level projects. Keep in mind that there are enterprise-grade open source solutions and functionally worthless COTS products. ● Budgeting – Assume no new staff and no money for training. Its just easier that way.
  • 16. Smart Grid – Problem Statement We now have our problem well defined: Provide analytic capability 10^3 faster based on 10^5 more time series data while adhering to the current corporate standards of safety, governance and regulatory compliance.
  • 17. Smart Grid – Problem Statement Let's take an overview of where we'll need to make improvements: ● Provide near real-time complex analytical capability to business – 10^3 improvement ● Based on 10^5 more time series data than currently managed – 10^5 improvement ● While adhering to the current corporate standards of safety, governance and regulatory compliance. – Constant So we need to store a lot more data in such a way that it can be accessed much faster. Are there data stuctures and algorithms that can provide for such an increase in time series data?
  • 18. Time Series Data ● Storing time series data means taking a large number of small files and persisting them using the time (at a minimum) as the natural order. ● The emphasis is on insert, since there is no mandatory prerequisite for updating or deleting time series data ● This data will come from multiple sources so time (and possibly location) is really the only metric that they must have in common
  • 19. Time Series Data ● What would a generic data structure look like for time series datum assuming that we need to strip out identifiable data and the majority of the data that we can preserve cannot be guaranteed to be present in all device types*? ● To make life a little easier, assume the object needs to be queried in the following manner: – Range queries which specify value conditions, such as: select from dataset where sales.December [1 million, 1.2 million]∗ ∈ – Pattern matching queries which rely on the definition of pattern similarity, such as: Given time series q, select r f rom dataset where similarity(r, q) > threshold (or distance(r, q) < δ) – Identity queries which rely on the avalability of an equality operator ● In order to rapidly query time series data, there needs to be a clear and relevant sequential-based index ● A reasonable key could composed of metric name : timestamp : random 64 bit integer : geo tag. If you needed to retrieve a block of data that would provide all of the smart meters readings in a High Consequence Area in Ohio in the third quarter of 2014, that would certainly be readily available. ● Note that this key also represent a strong hashing function. Since a hash table provides the fastest possible data retrieval (constant time), it is very important to ensure that the hash is well generated. A bad hash degraded a hash table to a linked list and we will get data in O(n) rather than O(1). * ANSI C12.19 is a specification; not an implementation
  • 20. Time Series Data Time series data lends itself to heavy disk I/O. This is not good news. ● Seek Rate – Measure of time it takes for data to be written or read to disk. This is 20ms for HDDs and 0.1ms for SSDs. For HDDs, this 20ms estimate is variable since the performace of the arm is non-linear because the head may need to move to multiple locations to read a single file. ● Transfer Rate – Measure of time is takes for to move data between the controller and the host system (external rate) as well as between the disk surface and the controller (internal rate). This can be 100MB/sec for HDD As far as disk operations are concerned, it is better to transfer than to seek and even then it's helpful to minimize the frequency of the transfers in favor of larger payloads. Since the data is both sequential and immutable, we can have a reasonable expectation that this can be optimized. The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds (1E-03), so it makes sense to short-circuit any deep conversations about partial stroking and hybrid drives.
  • 21. Time Series Data – Data Structure We need a storage system that ● minimizes seek time ● optimized for index searches ● optimized for sequential searches ● optimized for inserts
  • 22. Time Series Data – Data Structure ● Different database engines use different database structures including: – B-trees (or B+ trees) – Log Structured Merge Trees – Hash table – Graphs
  • 23. Time Series Data – Data Structure ● B+ Trees are balanced trees like B-Trees. ● They are different from B-Trees in that they maintain a doubly linked list to connect each leaf node to its sibling to speed up growth and contraction ● The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context. Most file systems use this structure and it is used by every major relational database vendor for their key indexes. ● B+ trees are great for random access ● All B-Trees are typically shallow but wide, so there are a minimal number of seeks.
  • 24. Time Series Data – Data Structure ● When inserting a record into a B+-Tree, you need to search the tree to find the location to insert the record. Depending on whether or not the tree is full, you may or may not need to split the tree to make room for the new node. The doubly linked lists help here. ● Since B-Trees are designed to be wide and shallow, there should be a minimal number of drive seeks. – From a practical point of view, B-trees, therefore, guarantee an access time of less than 10 ms even for extremely large datasets. ● Dr. Rudolf Bayer, inventor of the B-tree ● B+ Tree inserts are O(log n), which can be argued is the mathematical lower bound for balanced trees. ● So can we do better?
  • 25. Time Series Data – Data Structure ● There are two key potential areas of improvement for B+ Trees that are applicable for systems that are going to do large quantities of sequential writes: – Moving seek time from disk to memory – Moving from block data to log structured data ● The data structure is called a Log Structured Merge Tree (LSM) and the storage model is called Log Structured Storage (LSS)
  • 26. Time Series Data – Data Structure LSM Tree ● An LSM-Tree is a hybrid tree model that uses two trees: C0 and C1. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are written to C0 from C1 based on a size threshold. ● Insertions now run primarily at RAM rather than HDD speeds, or 1E-09 rather than 1E-03 seconds. Of course, they are written to disk, but that is where the LSS comes in. ● Note that many production systems systems concurrently write to a commit log on disk and C0 with the commit log getting deleted after flushing.
  • 27. Time Series Data – Data Structure LSS ● In a traditional storage system, there needs to be a considerable amount of overhead for updating and deleting existing members. In a log structured storage system, this overhead does not exist because a log structured storage system provides for an append-only sequence of data entries. Unlike a B+ Tree-based system, you don't find a location for new data, you merely append it to the end. ● Because new records are always added to the end, there is never any need for searching a tree for insertion, like in a B-tree storage structure. This allows for extremely predictable horizontal scaling. ● Providing concurrency and transactional semantics using Multiversion Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data in not modified. A view of the system at state Q at time A is just as valid is a view of Q at time B. ● Note that a POSIX-compliant file system typically defines a block size as 64KB. The HDFS file system was originally designed with 64MB block sizes but is often configured for 128MB block sizes.
  • 28. Time Series Data – Data Structure LSS ● In a traditional storage system, there needs to be a considerable amount of overhead for updating and deleting existing members. In a log structured storage system, this overhead does not exist because a log structured storage system provides for an append-only sequence of data entries. Unlike a B+ Tree-based system, you don't find a location for new data, you merely append it to the end. ● Because new records are always added to the end, there is never any need for searching a tree for insertion, like in a B-tree storage structure. This allows for extremely predictable horizontal scaling. ● Providing concurrency and transactional semantics using Multiversion Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data in not modified. A view of the system at state Q at time A is just as valid is a view of Q at time B.
  • 29. Time Series Data – Data Structure Hash Table – It is impossible to get better performance [O(1)] in any data structure when using a good hash that minimizes, or optimally avoids, collisions, or the best case. With the worst case, such as a hash that results in a lot of collisions, the worst case is [O(n)]. – There are drawback to hash tables, particularly with dynamic resizing so most database systems tend to go with a data structure that has a logarithmic best and worst case [O(log n)] in fear of linear time performance. – However, if you can take sequential data from a database and load it in memory and access it using a hash table, there are operations that can be performed that are not just faster, but would be impossible otherwise.
  • 30. Time Series Data – Compliance ● Having defined the optimal data structure, what is the require consistency model for the data? ● The CAP Theorem – Consistency All clients see the same view of the data, even in the presence of updates – Availability All clients can find some replica of the data, even in the presence of failure – Partition Tolerance The system property holds even if the system is partitioned. Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem is to identify the consistency model you need. ● For time series data from devices, availability and partition tolerance are key drivers. The data should never be lost and the system should not be unavailable, but a combination of near real-time access and a lack of hard relationships among rows make this the logical choice. The data should be partitioned across multiple based on a reasonable has in order to avoid the hot-spotting problem that can arise with time series data indexes. This is an AP model. ● For example, consider a banking system. If a customer makes a transfer from checking to savings, that anyone who looks at that data sees the same result. This would not be the case if the check and savings account were separately partitioned, so this a CA model. ● By the way, relational databases are all CA and CA is the only way to be ACID. NoSQL databases are either CP or AP.
  • 31. Time Series Data – Compliance
  • 32. Time Series Data – Data Structure To summarize, ● The best data structure for inserting into our persistent storage engine for time series data would run in O(log n) or logarithmic time. ● B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM Tree will deliver better performance for inserting time series data. ● Proper configuration of the LSM Tree engine could move a substantial amount of the operations from disk to memory (1E-09 rather than 1E-03) ● The best data structure for reading from our persistent storage engine would run in O(1) or constant time. ● The consistency model will need to be AP. ● Since we need to process 1E05 time more data, moving as much processing from 1E-03 to 1E-09 will absolutely get us there.
  • 33. Time Series Data – Processing To process a request from an analyst, we will need to: – retrieve the data from the data store – store the data in RAM – perform calculations on the data – return the result ● All in near -real time, which we have identified as between 1 and 15 minutes ● We will need to get bigger and faster, but what does bigger and faster really mean?
  • 34. Time Series Data – Processing ● Scale up – Hardware perspective: ● Adding more CPU to increase computational performance ● Adding RAM to increase query and data caching ● Adding more storage such as SSDs and partitioning various I/O processes to different physical disks – Database perspective: ● Replication techniques to enable various applications to connect to replica databases and eliminate computational and I/O constraints on the master ● Clustering configurations to handle failover and availability concerns ● Vertical and horizontal database partitioning to optimize query performance ● Scale Out – Sharding/Federating – Horizontal partitioning (separating operational and analytical) – Hadoop cluster
  • 35. Time Series Data – Processing ● So how much data are we talking about? – Ballpark, 10 million devices (smart meters, etc) will generate 1 billion reads per day. If each read is 1KB, we've already hit the TB mark. Let's say that we are talking somewhere in the TB to PT range within the forseeable future. ● How do you choose to scale up or scale out? Base it on the size of the data that needs to be loaded into RAM. – With transfer times at 100MB/sec, it'll take 17 minutes to move 100GB of data from disk to RAM, which breaks the 15 minute mark. – It would appear unlikely; however, that a single analyst would process 100GB of time series data at a time and its even unlikely that a group of analysts would do so.
  • 36. Time Series Data – Summary ● The goal was to provide 10^5 more time series data to analytic jobs running 10^5 times faster. ● Storing the time series data would optimally use an LSM Tree, preferably on an file system that provides for large, read-only block sizes. ● Reading the time series data would optimally use an in- memory hash table. ● Estimated job sizes would likely require GB of memory due to geographic and temporal realities of time series data from physical devices.
  • 37. Smart Grid – IT Perspective So, wait, do we have a Big Data problem?
  • 38. Big Data – 3Vs Big Data problems are traditionally defined using one or more of the following metrics: Volume : Volume = rows / objects / bytes Volume refers to the size of the data to be processed. Is the size of your enterprise data limited by cost or by potential business opportunity? What could you do with fast, cheap, predictable data growth? Big Data is any data that is expensive to manage and hard to extract value from. Michael Franklin, Director of Algorithms, Machines and People Lab, University of Berkeley Velocity : Velocity = number of rows / bytes per unit time Velocity refers to the latency of data processing relative to the growing demand for interactivity. How much more responsive could your business be if jobs that used to run overnight can now be run on demand? On a data set an order of magnitude larger? For less money? Real-time big data isn't just a process for storing petabytes or exabytes of data in a data warehouse. It's about combining and analyzing data so you can take the right action, at the right time, in the right place. Michael Minelli, Big Data, Big Analytics Variety : Variety = number of columns / dimensions / sources Variety refers to the diversity of sources, formats, quality and structures. Business value should not depend on a strict data model. Unstructured data and semi-structured data, used properly, can yield valuable and actionable insights. ...no greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics Doug Laney, 3-D Data Management: Controlling Data Volume, Velocity and Variety, Gartner 2001
  • 39. Smart Grid – 3Vs Volume : Volume = rows / objects / bytes The volume of data that utility companies will face is not in the same league as some data- intensive companies, such as Facebook, the move from monthly meter reads to 15 minute scalar reads means a 3,000-fold increase in data, so your existing system will not easily manage such an order of magnitude increase. Variety : Variety = number of columns / dimensions / sources Data from industrial control systems will be joined by unstructured and semi-structured data from weather forecasting systems, security cameras, maps, drawings, pictures, call center logs, social media and other web resources. An intelligent use of data munging can provide an integrated data pool from which to inform planning processes and decision making. Velocity : Velocity = number of rows / bytes per unit time New grid instrumentation and sensors generate relatively large amounts of streaming data and there are substantial benefits to be had if this data can be processed in real time for equipment reliability monitoring, outage prevention and security monitoring.
  • 40. Duke Energy – 3Vs For Duke, in order of priority: – Velocity – Variety – Volume
  • 41. IT Perspective – Action Alternatives ● So what's next? ● There are esentially three basic strategies that you can use when confronted with a well defined problem: – Do we need to take any action at all? – Can we take evolutionary steps? – Do we need to take revolutionary measures?
  • 42. IT Perspective – No Action ● Not taking any action at all would mean storing the Time Series Data in a relational database (Oracle or SQL Server or DB). They use a B+ Tree index and we really want an LSM Tree index. ● But IBM has a product called IBM Informix TimeSeries for Meter Data Management. Wouldn't that work? ● ;)
  • 43. IT Perspective – Revolutionary Action ● A revolutionary step would be to start throwing away existing platforms and processes and jumping into the new with both feet. ● That would only be a good idea if you were a start-up or if you were experimenting with a new line of business. ● The meter is the cash register of the utility. Revolutionary is inappropriate.
  • 44. IT Perspective – Evolutionary Action ● Evolutionary steps involve taking the smallest actions that would get the biggest result. ● Change one small thing, see what happens, repeat. ● It also doesn't hurt if we aren't the first people to do this since we can learn from other people. ● The natural first move would be to adopt a different data storage model.
  • 45. Big Data – Why? ● At scale, it has become apparent that one size does not fit all when it comes to data management. ● Different approaches were tried to make RDMS work with Big Data (ex sharding, denormalizing, etc). ● The hacks are typically operational nightmares and eventually new databases, collectively called NoSQL, were developed.
  • 46. Big Data – Why Not? There are four basic enterprise arguments against NoSQL should be addressed: – No ACID equals No Go – SQL is mandatory – NoSQL means NoStandards – NoSQL is for Startups
  • 47. No ACID equals No Go Critique: Mission critical data must be Atomic, Consistent, Isolated and Durable Response: Of course it should. Customer billing information needs Consistency and Availability (CA) and should therefore be stored in an RDBMS. Customer behavior patterns from your web site and call center logs; however, need not be treated in the same manner. No one in the NoSQL community is arguing for a replacement of RDBMS, just additional options.
  • 48. SQL is mandatory Critique: Low-level query languages (ex CODASYL) have never found support and SQL is a common language among business users and developers. Response: This is a great point and that's why so much effort has been put into Apache Hive and Cloudera Impala and DataStax CQL to provide a SQL front-end to Hadoop. Unfortunately, NoSQL is a name that has stuck and drew a line in the sand that is actually not there.
  • 49. NoSQL means NoStandards Critique: Large enterprises may have thousands of databases. These need accepted standards. Response: This critique typically falls apart on its own. Some of us remember something similar being leveled against the upstart relational model by the mainframe people. Ask yourself if your enterprise has a consistent set of standards that apply to all databases across the enterpise. Just for fun, try to find out how many different ways customer addresses are stored. Also, how do you define 'database'? Most likely not as 'some entity that contains valuable business data'. That would then include emails, spreadsheets, documents, call logs, images, log files (internet and device), and even external sources such as social media. Also, databases do not spring up fully formed and compliant. They are the result of architectural design and implementation discipline. If you apply the same discipline to developing your NoSQL schemas, then you have standards.
  • 50. NoSQL is for Startups Critique: Startups can use NoSQL because they are too new to have data structures. Established companies have established data structures. Response: This is true to a point. When you are a startup most; but not all, of your use cases are edge cases. A startup's billing system needs are probably similar to that of an established company, but we'll leave that aside. The fact of the matter is that the business landscape has changed and not all of these changes can be managed by the traditional corporate OLTP system. The explosion not only of the internet but to an even greater degree mobile devices has presented opportunities that few industries can ignore. The real numbers behind the adoption of NoSQL technologies tell the real story. More than half of the Fortune 500 companies today have a significant investment in Big Data and NoSQL related technologies.
  • 51. Scalability: CAP Theorem ● Consistency Commits are available across entire distributed system ● Availability System remains accessible and operational at all times ● Partition Tolerance Only a total system failure can cause the system to respond incorrectly Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem is to identify the consistency model you need. ● CA Traditional relational databases ● AP Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak ● CP BigTable-like systems, MongoDB, HBase, Memcached, Redis
  • 53. Big Data – IT Perspective ● We have now identified that we want to make the smallest change that has the biggest impact and that would be moving our underlying data structure for time series data from a B+ Tree to an LSM Tree. ● We have identified AP as our consistency model for scalable data. ● We have seen that Amazon's Dynamo data model most closely fits this model and we have identified three applications based on this model.
  • 54. IT Perspective – Apply Invariants ● The final step is to apply our invariants to our solution : adhering to the current corporate standards of safety, governance and regulatory compliance. ● All three are open-source projects and all three have an associated corporate entity, Cassandra has DataStax and Riak has Basho and CouchDB has CouchBase. ● Only DataStax and Basho offer an enterprise-grade distribution that would be appropriate for utilities.
  • 55. Comparison of Riak and Cassandra from Enterpise Development Framework ● Cut to the chase: Riak is written in Erlang and Cassandra is written in Java. You can even use Spring with Cassandra. ● One other interesting point is that Cassandra offers tunable consistency. This means that is you have a use case that needs a different consistency model than the AP model I described, then use Cassandra. It will allow you to blur the distinctions between strictly AP and strictly CP. ● From a development perspective, Cassandra is likely the most straightforward LSM datastore to integrate into a current framework. ● Erlang? Really?
  • 56. Comparison of Riak and Cassandra from Enterpise Architect Framework ● Between DataStax Enterprise and DataStax OpsCenter, DataStax can be the single vendor for data (Cassandra), analytics (hadoop) and search (Solr). ● DataStax offers enterprise grade auditing and security capabilities. ● DataStax has no single point of failure. Not a failover scenario; no SPOF. This is a huge difference. ● DataStax offers actual horizontal scalability. Replication and sharding of a B+ Tree datastore are not truly horizontal.