Opower, a Cloudera customer, discusss how they implemented a scalable energy analysis platform that generates personalized insights for millions of people. To date, Opower’s insights have collectively saved over 5 terawatt hours of energy and $500 million in energy bills.
2. Agenda
Why and How to Operationalize Analytics
Opower – Personalized Energy Usage Insights
Opower - Before, After, and Lessons Learned
Live Q&A
Speakers
TJ Laher
Product Marketing at Cloudera
Scott Kuehn
Data Architect at Opower
Get Social
#ClouderaWebinars
3. Why Automate Insights?
Unlock Competitive Advantages Decision Point Analytics Increase Data Returns
4. The Process of Operationalizing Analytics
Data
Generation
Batch Processing
Data Discovery
Analysis
Technique
Batch Processing
Report, Model,
or Rules
Analyst
Discovery
Flow
Data
Generation
Stream or Batch
Processing
Respond to Data
Feed Data
Application
Optimize Report,
Model, or Rules
Operational
Analytics
Flow
5. Preparing for Operational Analytics
Data Sources Data Analysis Data Serving
Human Data
Discovery
Structured
Unstructured
Data Processing &
Storage
Batch
Stream
Optimize
Extend
Innovate
Machine
Response
Single Analysis
Data
Applications
Store
7. Opower Overview
A Software as a Service Customer Engagement Platform
The Company
• Serving 95+ utilities in 9 countries
• Over 5TWh saved to date
• 40% of US household data under management totaling 300
billion reads
Our DNA
• Behavioral science software
• Data analytics
• Consumer marketing
• User-centric design
11. Insight Creation Environments
Product Calculation and Delivery Offline Analysis and Experimentation
Insight Delivery
Insight Calculation
Hive BI
Raw
MR
Batch Tools
HDFS
Reporting
External
Feeds
HBase Export
Non-product
Insights
12. What does this mean to end users?
Batch Analytic Calculations Individual Insight Query Latency
Pre-Hadoop Modern Hadoop
Hours
48
24
12
Hours
Days
Pre-Hadoop
Seconds
3
2
1
~10ms
3 secs
Analytic Development Time
Pre-Hadoop
Months
5
3
1
Weeks
Months
Modern Hadoop Modern Hadoop
13. Key Lessons Learned: External Support
1. Issue resolution and escalation
2. Backport critical patches
3. Tuning and configuration
guidance
1. Community support channels
1. New features, bug fixes
2. Roadmap planning
14. Key Lessons Learned: Cluster operations
1. Cloudera Manager is useful: alerts, log
collection, metrics
1. Upgrade often (safely)
2. Off-cluster data backup/replication
Customized charting via CM UI
Opower Intro: Who is Opower and what does Opower do?
Produce energy insights to help utilities and customers manage energy consumption.
100+millions of meter reads received daily. Millions of individual insight calculations routinely created, from simple trending analytics, to more advanced forecasting/prediction.
Energy saved: 5+TW hours, $500M energy bill savings, >6 billion lbs CO2
Product lines:
Consumer engagement
Energy efficiency
Demand Response
Hadoop-based insights are a critical portion of each of these product lines.
Transition: Some example hadoop-based insights:
Two example of Opower’s personalized insights that use hadoop components: neighbor comparisons and unusual usage alerts
Energy usage is stored in HBase, along with insights derived from the energy usage. Billions of energy usage data reads are stored in HBase. Insights are served directly from HBase.
Unusual usage alerts were the first use case for HBase/hadoop. We sold a deal that required us to generate “unusual usage alerts” at a scale we had yet hit
UUA are email or phone messages we send to let customers know if they are trending towards higher than usual energy usage
We also project the bill for them and can let them know if they are going to pay more than expected
Transition: The initial architecture we built to calculate and deliver this insight
Hadoop has been used in production at Opower since 2012.
Overview of end-to-end architecture: Data is copied from single-tenant mysql databases into HBase. MySQL is single tenant (one DB per opower client), and we have > 100 MySQL dbs in production. Batch clients read from HBase. Other workloads running on the cluster as an attempt to eliminate the need to support clusters for separate workloads. Sqoop is a mapreduce job that reads data from mysql and outputs to some other source, like hive+hdfs or in our case HBase.
Challenges:
Sqoop ingest introduced a lot of memory pressure on region servers and traffic on mysql read slaves. Need to take caution to not introduce excessive MySQL load from sqoop queries, as the databases are serving other critical apps
Queries required longer multi-row scans and aggregations. Lot’s of tuning was necessary, such as increase in region file sizes, memstore sizes, heap size. Disable major compactions, HDFS short-circuit.
Composite row keys with timestamps in them, thinking about Hbase more like a relational table than a big sorted map
We had supporting data in single-tenanted because we were sqooping it over from the mysql databases
Because of how we designed the schema, we needed multiple tables to store the data
Single-tenanted tables adds operational overhead and difficulty in tracking bottlenecks in the process
Initial support of ad-hoc MR jobs via Hive was quickly removed due to unmanageable load
This architecture has been successful, but difficult to scale. The hbase schema was difficult to extend to support new insights and there was no story for offline analytics and experimentation.
Transition: V2 (the modern) Opower hadoop architecture addresses these issues
Overview/walk through of the major components. Usage data is collected from the utility and directly ingested into HBase via bulkloading MR jobs. [Explain bulkloading] Data is stored in an Entity-centric table, where each entity is a single hbase row containing the energy usage history for a household, and any derived analytics from that energy, such as bill forecasts and neighbor comparisons. MapReduce jobs will periodically referesh these analytics, but some are also refreshed on-demand in a streaming fashion, as insights are queried. Data is replicated to the data warehouse cluster via a combination of HBase replication (for direct puts) and as an HFile distcp step during the intial bulkload ingest (not pictured).
Full, multi-tenant datasets are now available to be analyzed in the data warehouse, which has enabled new off-line analytics such as product eligibility calculations, and a general test-bed experimenting with new insights. There is no longer a need to painstakingly collect data from multiple sources or worry about crashing a mysql slave when running a full table scan.
Improvements:
Write path performance via bulkloading. less GC pressure in the region server, no memstore flushes. fewer RPC’s/round-trips to the databsae. Simultaneous bulkloading via distcp into the data warehouse hbase instance, so the data warehouse has fresh data.
Entity-centric HBase schema provides ability to add new analytics/insights in a scalable manner. Data used to derive a personalized insight is stored in a single HBase row, providing data locality for scans and eliminating hbase overhead of multi-row traversals and aggregations.
Secondary analyics were moved to data warehouse, reducing the memory pressure and task contention on the service cluster. MR jobs on the service cluster are specific to generation of personalized insights served at low-latency.
The new architecture has worked, but there are still areas we want to improve, such as automation and ETL tooling that will make it easier to load new datasets and create new insights.
Transition: This new architecture enables two distinct environments for creating new data insights
Product calculations are built as producer-style mapreduce jobs – reading and writing to the same HBase row. For example, a trend in energy usage for the current bill period will be derived from the usage data present in the row and used to forecast the customers energy consumption and spending for the current period.
Insights are accessed by a service query layer. An template HBase service container can be easily extended to create service API’s for different insight products. Service client applications are used by reporting pipelines and embedded web components.
Offline analysis and experimentation occurs in the data warehouse. Hive, BI tools (platfora, datameer), and raw mapreduce jobs are used to create aggregate reports, and non-product analytics such as customer program eligibility.
These tools are also used for ad-hoc analysis of full energy usage datasets, such as electric car charging trends or the impact of the super bowl on energy consumption.
In the future we look to link the two systems, enabling analytics developed offline to be ‘promoted’ to product calculations.
Transition: What’s been the result of a switch to hadoop architecture?
Batch analytics calculated via the producer pattern are much more amenable to the MR parallelization and take advantage of HBase row locality. Run time reduced significantly. Some jobs could be multi-tenant, which are easier to operate.
Individual insight query latency dropped from several seconds to ~10ms. Our performance tests measure at the 99.999% point on the latency tail, so average time is even faster. Query latency has been critical for SOA-model SLA’s, since multiple external services will access this data in real time.
Analytic development time is faster, although it could still be improved. Development speedups came from adding a data warehouse cluster for development and experimentation, which used more analyst friendly tools like hive and scalding. Also, the entity-based schema used in production is more amenable to adding new data.
Transition: We’ve had some success but encountered challenges along the way. Here are some lessons we learned:
There are numerous experts in the HBase community, and chances are someone has tried to do what you have
Cloudera support critical in helping with hadoop challenges:
Cloudera Support
Issue resolution and escalation. Ex:Jobtracker memory issues and configuration. Escalate to the larger hadoop community
Backport critical patches. HBase sequence ID and cell overwrite bugs
Tuning and configuration guidance. HBase, sqoop
Apache HBase community
Community technical guidance: message boards, meetups, Hbasecon
New feature development: hbase community always open to new ideas for improvements
Roadmap planning: What relevent features will be released in the next versions of the software,. Ex, how would stripe-compaction or new block cache implementations impact your architecture?
Refs:
HBase-8521 (cell overwrites)
HBase-6590 (hfile sequence id’s)base
HBase-10958 (blindspot)
Transition: Other lessons learned
With any moderately sized hadoop cluster you will need infrastructure to collect logs, monitor processes, and analyze metrics. We have effectively used cloudera manager for this purpose. CM will alerting on service process status changes, report performance metrics like read-latency, clock-skew. Post-issue forensics via log file analysis. Custom charting enables you to create dashboards to analyze your specific bottlenecks or recurring issues.
Upgrade often. Hadoop components are routinely patched, so be sure to upgrade and use cloudera and the community to understand issues with your current releases. Always test your upgrades.
Backup data for safety. We use HBase snapshot exports, then distcp to a backup cluster. Cloudera manager has a useful UI for managing distcp’s.