Smart Grids and Big Data

Smart Grids and Big Data
Integration and Governance
Challenges and Opportunities
For Public Utilities

Smart Grid and Big Data
● There are any number of vendors and
publications stating that the IT departments of
utilities need to invest in Big Data to manage
Smart Grids.
● Saying something does not make it true, even if
you saying it very loudly and very often. It just
makes it noisy.
● Let's swap out marketing and hype for logic and
math and separate the signal from the noise.

What is a Smart Grid?
First of all, what is a “Smart Grid”? It can mean
different things to different people in a utility
depending on their perspective:
● Customer
● Distribution
● Transmission
● Generation
● Regulatory

Smart Grid – Customer Perspective
● Smart metering means going from one meter reading
per month to a reading every fifteen minutes, a 3,000-
fold increase. For every one million meters, a utility
can except to process 96 million reads per day.
● Time-Of-Day and Time-Of-Use billing considerations
are now in play
● Meters can communicate to customers but that raises
the question, who should be in charge of managing
load based on this information?

Smart Grid – Distribution Perspective
● Advanced sensoring and switching devices at
the distribution feeder level can make
distribution system automation affordable
● Selective load control
● Managing distribution generation and even
islanding

Smart Grid – Transmission Perspective
● Increase stability and control by combining phase
measurement units (PMUs) and GPS with a Supervisory
Control and Data Acquisition Unit (SCADA) at a central
control facility
● Flexible AC Transmission Systems (FACTS) are involved in
the derivation of Interconnection Reliability Operating Limits
(IROL) and require expensive monitoring devices, although
they are still less costly than building new lines
● Distributed and autonomous control could be used to
eventually create a “self-healing grid”

Smart Grid – Generation Perspective
● The Unit Commitment Problem (UCP) refers to scheduling power
generators (units) to meet electricity demand (load). Always complex
and critical, the variability of wind farms requires different algorithms.
● Additionally, the move from regulated to deregulated markets means
moving to day-ahead and real-time markets. Day-ahead requires
making a commitment based on a prior-day forecast while real-time
requres adjusting the output per unit hourly with surplus/deficit units
being traded on the Independent System Operator (ISO) market.

Smart Grid – Regulatory Perspective
● Smart Meters are considered (rightly or wrongly) to
be a significant privacy and security risk and there
will likely be a patchwork of federal and state
regulations of varying technical feasibility.
● As emerging sources of energy are developed,
there will be additional regulations.
● As new technologies are developed, there will be
additional regulations.
● Basically, there will be additional regulations.

Smart Grid – IT Perspective
● There are a lot of managers that will drive up the heat and intensity
without providing much clarity.
● There are a lot of vendors that have big ticket items that they say will fix
our problems.
● Let's step back and clearly define the problem so we can identify what
form a solution would take before we start writing checks for vaporware.
● In this next section, we'll come up with a clear problem definition and
come up with an algorithmic approach to the problem. We should at the
very least have a good idea of the Big-O of our proposed solution space.
● Once we have a framework, we can more intelligently choose an
implementation.

●

Smart Grid – Requirements
Essentially, there has been one fundamental technical change: More devices are reporting more data more frequently
● What are these “devices” again?
– smart meters
– sensors
– syncophasors
●
What do we mean by “reporting”?
– There are just devices, so each one individually is really only capable of generating a text based log file. It could be fixed or variable
length, xml or json but it will be text. Also, all of these devices now have an IP address so we will receive it on the network
somehow. Basically, we will not be getting a hand-drawn picture on microfiche (don't laugh: eGIS does).
●
What do we mean by “more”?
– We know that smart meters generate 3K time more reading intervals than traditional meters. Their payload is a lot bigger, too. We
also know that there are more sensors and control devices that are used, but we don't have hard numbers on that. Since the
company will likely grow let's call it 10,000 or 10^5.
● What do we mean by “data”?
– This type of data is called time series data: which Wikipedia tells us “is a sequence of data points, measured typically at successive
points in time spaced at uniform time intervals.”
● What do we mean by “frequently”
– For smart meters, that's typically (but not exclusively) 15 minute intervals. Sensors and syncophasors may be more or less but we
are talking about minutes. So we're not talking about days or hours anymore but at least we aren't talking about or seconds or
milliseconds. We'll call it “near-real time” or a 10^3 increase in speed (60 * 24 = 1440) .

Essentially, there has been one fundamental requirement change: provide more frequent and robust analytics
● What do we mean by “provide”
– There will need to be both ad-hoc and structured analytics and reporting. It is worth noting that data at scale is often not
amenable to the same types of reports that are used for more modest, enterprise-size data.
● What do we mean by “more frequent”?
– For most use cases, the difference between “advanced” and “standard” is speed, not detail. A generation system is
advanced if it can resolve unit commitment problems for the real-time, rather than daily, market. A transmission system is
considered advanced if it can resolve phase issues before they cause a problem. The engineering problems are well
defined by the laws of physics; we just need to be faster in order to be more reliable, effective and affordable.
●
What do we mean by “robust”?
– If we give a generation analyst access to such a deep, broad and fast pool of data, there are multiple algorithms that can
be run against that data to possibly develop new strategies for managing unit commitment in a deregulated market with a
wind farm based that could never be tried without that data.
● What do we mean by “analytics”?
– Analytics, or analysis of data, is a process of inspecting, cleaning, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision making. The major catagories or analytics
that are typically performed on time series data are on the following slide.

Time Series Data – Requirements
● Indexing
– Given a query time series Q, and some similarity/dissimilarity measure D(Q, C), find the most similar time series in database
DB
●
Clustering
– Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q, C)
●
Classification
– Given an unlabeled time series Q, assign it to one of two or more predefined classes
● Prediction
– Given a time series Q containing n data points, predict the value at time n + 1.
●
Summarization
– Given a time series Q containing n data points where n is an extremely large number, create a (possibly graphic)
approximation of Q which retains its essential features but fits on a single page, screen, etc.
● Anomoly Detection
– Given a time series Q, assumed to be normal, and an unannotated time series R, find all sections of R which contain
anomalies or “surprising/interesting/unexpected” occurrences
●
Segmentation
– Given a time series Q containing n data points, construct a model Q1
from K piecewise segments (K << n), such that Q
closely approximates Q1

Essentially, there have been some things that will not change.
● Safety
– When you have a lot of data that you never had before and are using it to ask questions you never asked before, you need
to make sure that the the answer you get makes sense.
– Be conscientious with customer's Personal Identifiable Information (PII). Write secure code: utilities get hacked.
●
Governance
– When there are fundamental changes, some of which are tied to substantial dollar amounts, there is often a push to cut
corners. It is never a good idea to be less careful when there is more at stake, but that is typical.
● Compliance
– This is a regulated industry. We are frequently audited. Architect, design and develop for transparency.
● Service Level Agreements
– The solutions used must be able to be implemented on Silver, Gold and Platinum level projects. Keep in mind that there are
enterprise-grade open source solutions and functionally worthless COTS products.
● Budgeting
– Assume no new staff and no money for training. Its just easier that way.

Smart Grid – Problem Statement
We now have our problem well defined:
Provide analytic capability 10^3 faster based on
10^5 more time series data while adhering to the
current corporate standards of safety, governance
and regulatory compliance.

Smart Grid – Problem Statement
Let's take an overview of where we'll need to make improvements:
● Provide near real-time complex analytical capability to business
– 10^3 improvement
● Based on 10^5 more time series data than currently managed
– 10^5 improvement
● While adhering to the current corporate standards of safety, governance and
regulatory compliance.
– Constant
So we need to store a lot more data in such a way that it can be accessed much
faster. Are there data stuctures and algorithms that can provide for such an
increase in time series data?

Time Series Data
● Storing time series data means taking a large
number of small files and persisting them using
the time (at a minimum) as the natural order.
● The emphasis is on insert, since there is no
mandatory prerequisite for updating or deleting
time series data
● This data will come from multiple sources so
time (and possibly location) is really the only
metric that they must have in common

Time Series Data
● What would a generic data structure look like for time series datum assuming that we need to strip out identifiable
data and the majority of the data that we can preserve cannot be guaranteed to be present in all device types*?
● To make life a little easier, assume the object needs to be queried in the following manner:
– Range queries which specify value conditions, such as: select from dataset where sales.December [1 million, 1.2 million]∗ ∈
– Pattern matching queries which rely on the definition of pattern similarity, such as: Given time series q, select r f rom dataset where
similarity(r, q) > threshold (or distance(r, q) < δ)
– Identity queries which rely on the avalability of an equality operator
● In order to rapidly query time series data, there needs to be a clear and relevant sequential-based index
● A reasonable key could composed of metric name : timestamp : random 64 bit integer : geo tag. If you needed to
retrieve a block of data that would provide all of the smart meters readings in a High Consequence Area in Ohio in the
third quarter of 2014, that would certainly be readily available.
● Note that this key also represent a strong hashing function. Since a hash table provides the fastest possible data
retrieval (constant time), it is very important to ensure that the hash is well generated. A bad hash degraded a hash
table to a linked list and we will get data in O(n) rather than O(1).
* ANSI C12.19 is a specification; not an implementation

Time Series Data
Time series data lends itself to heavy disk I/O. This is not good news.
●
Seek Rate
– Measure of time it takes for data to be written or read to disk. This is 20ms for HDDs and 0.1ms for SSDs.
For HDDs, this 20ms estimate is variable since the performace of the arm is non-linear because the head
may need to move to multiple locations to read a single file.
● Transfer Rate
– Measure of time is takes for to move data between the controller and the host system (external rate) as
well as between the disk surface and the controller (internal rate). This can be 100MB/sec for HDD
As far as disk operations are concerned, it is better to transfer than to seek and even then it's
helpful to minimize the frequency of the transfers in favor of larger payloads. Since the data is
both sequential and immutable, we can have a reasonable expectation that this can be
optimized.
The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds (1E-03), so it
makes sense to short-circuit any deep conversations about partial stroking and hybrid drives.

Time Series Data – Data Structure
We need a storage system that
● minimizes seek time
● optimized for index searches
● optimized for sequential searches
● optimized for inserts

● Different database engines use different
database structures including:
– B-trees (or B+ trees)
– Log Structured Merge Trees
– Hash table
– Graphs

● B+ Trees are balanced trees like B-Trees.
● They are different from B-Trees in that they maintain a doubly
linked list to connect each leaf node to its sibling to speed up
growth and contraction
● The primary value of a B+ tree is in storing data for efficient
retrieval in a block-oriented storage context. Most file systems
use this structure and it is used by every major relational
database vendor for their key indexes.
● B+ trees are great for random access
● All B-Trees are typically shallow but wide, so there are a
minimal number of seeks.

● When inserting a record into a B+-Tree, you need to search the
tree to find the location to insert the record. Depending on whether
or not the tree is full, you may or may not need to split the tree to
make room for the new node. The doubly linked lists help here.
● Since B-Trees are designed to be wide and shallow, there should
be a minimal number of drive seeks.
– From a practical point of view, B-trees, therefore, guarantee an access
time of less than 10 ms even for extremely large datasets.
● Dr. Rudolf Bayer, inventor of the B-tree
● B+ Tree inserts are O(log n), which can be argued is the
mathematical lower bound for balanced trees.
● So can we do better?

● There are two key potential areas of
improvement for B+ Trees that are applicable for
systems that are going to do large quantities of
sequential writes:
– Moving seek time from disk to memory
– Moving from block data to log structured data
● The data structure is called a Log Structured
Merge Tree (LSM) and the storage model is
called Log Structured Storage (LSS)

LSM Tree
● An LSM-Tree is a hybrid tree model that uses two trees: C0
and C1. C0 is smaller and entirely resident in memory,
whereas C1 is resident on disk. New records are written to
C0 from C1 based on a size threshold.
● Insertions now run primarily at RAM rather than HDD speeds,
or 1E-09 rather than 1E-03 seconds. Of course, they are
written to disk, but that is where the LSS comes in.
● Note that many production systems systems concurrently
write to a commit log on disk and C0 with the commit log
getting deleted after flushing.

LSS
● In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured storage
system, this overhead does not exist because a log structured storage system
provides for an append-only sequence of data entries. Unlike a B+ Tree-based
system, you don't find a location for new data, you merely append it to the end.
● Because new records are always added to the end, there is never any need for
searching a tree for insertion, like in a B-tree storage structure. This allows for
extremely predictable horizontal scaling.
● Providing concurrency and transactional semantics using Multiversion Concurrency
and Control (MVCC) is easier in LSS than B-tree since existing data in not modified.
A view of the system at state Q at time A is just as valid is a view of Q at time B.
● Note that a POSIX-compliant file system typically defines a block size as 64KB. The
HDFS file system was originally designed with 64MB block sizes but is often
configured for 128MB block sizes.

LSS
● In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured
storage system, this overhead does not exist because a log structured
storage system provides for an append-only sequence of data entries. Unlike
a B+ Tree-based system, you don't find a location for new data, you merely
append it to the end.
● Because new records are always added to the end, there is never any need
for searching a tree for insertion, like in a B-tree storage structure. This allows
for extremely predictable horizontal scaling.
● Providing concurrency and transactional semantics using Multiversion
Concurrency and Control (MVCC) is easier in LSS than B-tree since existing
data in not modified. A view of the system at state Q at time A is just as valid
is a view of Q at time B.

Hash Table
– It is impossible to get better performance [O(1)] in any data structure when using a
good hash that minimizes, or optimally avoids, collisions, or the best case. With the
worst case, such as a hash that results in a lot of collisions, the worst case is [O(n)].
– There are drawback to hash tables, particularly with dynamic resizing so most
database systems tend to go with a data structure that has a logarithmic best and
worst case [O(log n)] in fear of linear time performance.
– However, if you can take sequential data from a database and load it in memory and
access it using a hash table, there are operations that can be performed that are not
just faster, but would be impossible otherwise.

Time Series Data – Compliance
● Having defined the optimal data structure, what is the require consistency model for the data?
● The CAP Theorem
– Consistency
All clients see the same view of the data, even in the presence of updates
– Availability
All clients can find some replica of the data, even in the presence of failure
– Partition Tolerance
The system property holds even if the system is partitioned.
Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem
is to identify the consistency model you need.
●
For time series data from devices, availability and partition tolerance are key drivers. The data should never be lost
and the system should not be unavailable, but a combination of near real-time access and a lack of hard relationships
among rows make this the logical choice. The data should be partitioned across multiple based on a reasonable has
in order to avoid the hot-spotting problem that can arise with time series data indexes. This is an AP model.
● For example, consider a banking system. If a customer makes a transfer from checking to savings, that anyone who
looks at that data sees the same result. This would not be the case if the check and savings account were separately
partitioned, so this a CA model.
● By the way, relational databases are all CA and CA is the only way to be ACID. NoSQL databases are either CP or AP.

Time Series Data – Compliance

To summarize,
●
The best data structure for inserting into our persistent storage engine for time
series data would run in O(log n) or logarithmic time.
●
B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM
Tree will deliver better performance for inserting time series data.
●
Proper configuration of the LSM Tree engine could move a substantial amount
of the operations from disk to memory (1E-09 rather than 1E-03)
● The best data structure for reading from our persistent storage engine would
run in O(1) or constant time.
● The consistency model will need to be AP.
●
Since we need to process 1E05 time more data, moving as much processing
from 1E-03 to 1E-09 will absolutely get us there.

Time Series Data – Processing
To process a request from an analyst, we will need to:
– retrieve the data from the data store
– store the data in RAM
– perform calculations on the data
– return the result
● All in near -real time, which we have identified as between
1 and 15 minutes
● We will need to get bigger and faster, but what does
bigger and faster really mean?

● Scale up
– Hardware perspective:
● Adding more CPU to increase computational performance
● Adding RAM to increase query and data caching
● Adding more storage such as SSDs and partitioning various I/O processes to different physical disks
– Database perspective:
● Replication techniques to enable various applications to connect to replica databases and eliminate
computational and I/O constraints on the master
● Clustering configurations to handle failover and availability concerns
● Vertical and horizontal database partitioning to optimize query performance
● Scale Out
– Sharding/Federating
– Horizontal partitioning (separating operational and analytical)
– Hadoop cluster

●
So how much data are we talking about?
– Ballpark, 10 million devices (smart meters, etc) will generate 1 billion reads per day.
If each read is 1KB, we've already hit the TB mark. Let's say that we are talking
somewhere in the TB to PT range within the forseeable future.
●
How do you choose to scale up or scale out? Base it on the size of the data
that needs to be loaded into RAM.
– With transfer times at 100MB/sec, it'll take 17 minutes to move 100GB of data from
disk to RAM, which breaks the 15 minute mark.
– It would appear unlikely; however, that a single analyst would process 100GB of time
series data at a time and its even unlikely that a group of analysts would do so.

Time Series Data – Summary
● The goal was to provide 10^5 more time series data to
analytic jobs running 10^5 times faster.
● Storing the time series data would optimally use an LSM
Tree, preferably on an file system that provides for
large, read-only block sizes.
● Reading the time series data would optimally use an in-
memory hash table.
● Estimated job sizes would likely require GB of memory
due to geographic and temporal realities of time series
data from physical devices.

So, wait, do we have a Big Data problem?

Big Data – 3Vs
Big Data problems are traditionally defined using one or more of the following metrics:
Volume : Volume = rows / objects / bytes
Volume refers to the size of the data to be processed. Is the size of your enterprise data limited by cost or by potential business
opportunity? What could you do with fast, cheap, predictable data growth?
Big Data is any data that is expensive to manage and hard to extract value from.
Michael Franklin, Director of Algorithms, Machines and People Lab, University of Berkeley
Velocity : Velocity = number of rows / bytes per unit time
Velocity refers to the latency of data processing relative to the growing demand for interactivity. How much more responsive could your
business be if jobs that used to run overnight can now be run on demand? On a data set an order of magnitude larger? For less money?
Real-time big data isn't just a process for storing petabytes or exabytes of data in a data warehouse. It's about combining and analyzing data so you can
take the right action, at the right time, in the right place.
Michael Minelli, Big Data, Big Analytics
Variety : Variety = number of columns / dimensions / sources
Variety refers to the diversity of sources, formats, quality and structures. Business value should not depend on a strict data model.
Unstructured data and semi-structured data, used properly, can yield valuable and actionable insights.
...no greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent
data semantics
Doug Laney, 3-D Data Management: Controlling Data Volume, Velocity and Variety, Gartner 2001

Smart Grid – 3Vs
Volume : Volume = rows / objects / bytes
The volume of data that utility companies will face is not in the same league as some data-
intensive companies, such as Facebook, the move from monthly meter reads to 15 minute scalar
reads means a 3,000-fold increase in data, so your existing system will not easily manage such an
order of magnitude increase.
Variety : Variety = number of columns / dimensions / sources
Data from industrial control systems will be joined by unstructured and semi-structured data from
weather forecasting systems, security cameras, maps, drawings, pictures, call center logs, social
media and other web resources. An intelligent use of data munging can provide an integrated data
pool from which to inform planning processes and decision making.
Velocity : Velocity = number of rows / bytes per unit time
New grid instrumentation and sensors generate relatively large amounts of streaming data and
there are substantial benefits to be had if this data can be processed in real time for equipment
reliability monitoring, outage prevention and security monitoring.

Duke Energy – 3Vs
For Duke, in order of priority:
– Velocity
– Variety
– Volume

IT Perspective – Action Alternatives
● So what's next?
● There are esentially three basic strategies that
you can use when confronted with a well
defined problem:
– Do we need to take any action at all?
– Can we take evolutionary steps?
– Do we need to take revolutionary measures?

IT Perspective – No Action
● Not taking any action at all would mean storing
the Time Series Data in a relational database
(Oracle or SQL Server or DB). They use a B+
Tree index and we really want an LSM Tree
index.
● But IBM has a product called IBM Informix
TimeSeries for Meter Data Management.
Wouldn't that work?
● ;)

IT Perspective – Revolutionary Action
● A revolutionary step would be to start throwing
away existing platforms and processes and
jumping into the new with both feet.
● That would only be a good idea if you were a
start-up or if you were experimenting with a
new line of business.
● The meter is the cash register of the utility.
Revolutionary is inappropriate.

IT Perspective – Evolutionary Action
● Evolutionary steps involve taking the smallest
actions that would get the biggest result.
● Change one small thing, see what happens,
repeat.
● It also doesn't hurt if we aren't the first people to
do this since we can learn from other people.
● The natural first move would be to adopt a
different data storage model.

Big Data – Why?
● At scale, it has become apparent that one size
does not fit all when it comes to data
management.
● Different approaches were tried to make RDMS
work with Big Data (ex sharding, denormalizing,
etc).
● The hacks are typically operational nightmares
and eventually new databases, collectively called
NoSQL, were developed.

Big Data – Why Not?
There are four basic enterprise arguments
against NoSQL should be addressed:
– No ACID equals No Go
– SQL is mandatory
– NoSQL means NoStandards
– NoSQL is for Startups

No ACID equals No Go
Critique: Mission critical data must be Atomic,
Consistent, Isolated and Durable
Response: Of course it should. Customer billing
information needs Consistency and Availability (CA)
and should therefore be stored in an RDBMS.
Customer behavior patterns from your web site and
call center logs; however, need not be treated in the
same manner. No one in the NoSQL community is
arguing for a replacement of RDBMS, just additional
options.

SQL is mandatory
Critique: Low-level query languages (ex CODASYL) have
never found support and SQL is a common language
among business users and developers.
Response: This is a great point and that's why so much
effort has been put into Apache Hive and Cloudera Impala
and DataStax CQL to provide a SQL front-end to Hadoop.
Unfortunately, NoSQL is a name that has stuck and drew a
line in the sand that is actually not there.

NoSQL means NoStandards
Critique: Large enterprises may have thousands of databases. These need accepted
standards.
Response: This critique typically falls apart on its own.
Some of us remember something similar being leveled against the upstart relational
model by the mainframe people. Ask yourself if your enterprise has a consistent set of
standards that apply to all databases across the enterpise. Just for fun, try to find out
how many different ways customer addresses are stored.
Also, how do you define 'database'? Most likely not as 'some entity that contains
valuable business data'. That would then include emails, spreadsheets, documents, call
logs, images, log files (internet and device), and even external sources such as social
media. Also, databases do not spring up fully formed and compliant. They are the result
of architectural design and implementation discipline. If you apply the same discipline to
developing your NoSQL schemas, then you have standards.

NoSQL is for Startups
Critique: Startups can use NoSQL because they are too new to have data structures.
Established companies have established data structures.
Response: This is true to a point. When you are a startup most; but not all, of your use
cases are edge cases. A startup's billing system needs are probably similar to that of an
established company, but we'll leave that aside. The fact of the matter is that the
business landscape has changed and not all of these changes can be managed by the
traditional corporate OLTP system. The explosion not only of the internet but to an even
greater degree mobile devices has presented opportunities that few industries can
ignore.
The real numbers behind the adoption of NoSQL technologies tell the real story. More
than half of the Fortune 500 companies today have a significant investment in Big Data
and NoSQL related technologies.

Scalability: CAP Theorem
● Consistency
Commits are available across entire distributed system
● Availability
System remains accessible and operational at all times
● Partition Tolerance
Only a total system failure can cause the system to respond incorrectly
Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the
CAP Theorem is to identify the consistency model you need.
● CA
Traditional relational databases
● AP
Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak
● CP
BigTable-like systems, MongoDB, HBase, Memcached, Redis

Big Data – IT Perspective
● We have now identified that we want to make the
smallest change that has the biggest impact and that
would be moving our underlying data structure for
time series data from a B+ Tree to an LSM Tree.
● We have identified AP as our consistency model for
scalable data.
● We have seen that Amazon's Dynamo data model
most closely fits this model and we have identified
three applications based on this model.

IT Perspective – Apply Invariants
● The final step is to apply our invariants to our solution :
adhering to the current corporate standards of safety,
governance and regulatory compliance.
● All three are open-source projects and all three have an
associated corporate entity, Cassandra has DataStax and
Riak has Basho and CouchDB has CouchBase.
● Only DataStax and Basho offer an enterprise-grade
distribution that would be appropriate for utilities.

Comparison of Riak and Cassandra from
Enterpise Development Framework
● Cut to the chase: Riak is written in Erlang and Cassandra is written in
Java. You can even use Spring with Cassandra.
● One other interesting point is that Cassandra offers tunable
consistency. This means that is you have a use case that needs a
different consistency model than the AP model I described, then use
Cassandra. It will allow you to blur the distinctions between strictly AP
and strictly CP.
● From a development perspective, Cassandra is likely the most
straightforward LSM datastore to integrate into a current framework.
● Erlang? Really?

Comparison of Riak and Cassandra from
Enterpise Architect Framework
● Between DataStax Enterprise and DataStax OpsCenter, DataStax
can be the single vendor for data (Cassandra), analytics (hadoop)
and search (Solr).
● DataStax offers enterprise grade auditing and security capabilities.
● DataStax has no single point of failure. Not a failover scenario; no
SPOF. This is a huge difference.
● DataStax offers actual horizontal scalability. Replication and
sharding of a B+ Tree datastore are not truly horizontal.

Smart Grids and Big Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Smart Grids and Big Data

Similar a Smart Grids and Big Data (20)

Más de Dave Callaghan

Más de Dave Callaghan (9)

Último

Último (20)

Smart Grids and Big Data