A close look from the Oracle NoSQL Database product management group at the challenges associated with web scale personalization type workloads. How NoSQL technology is enabling this class of application and how the inability to meet the demands of these emerging workloads can impact the business financially.
Definition of Personalization: Wikipedia, “Personalization”, (http://en.wikipedia.org/wiki/Personalization), Retrieved 15 Sept 2013 References: Doman, James. "What is the definition of "personalization"?” (http://www.quora.com/What-is-the-definition-of-personalization) . Quora. Retrieved 19 March 2012.
Retail, especially brick and mortar retail, used to be very simple. A customer walked in, found what they wanted, paid for it and walked out. There was typically almost no interaction with the customer and certainly no personalized experience. Very simple, limited set of steps.
The store didn’t really interact with the customer at all, unless they needed help. The only record of the customer ever visiting the store was in the form of a sales receipt. Most purchases consisted of a few items. There was no opportunity to recommend other products, customize the experience or learn about how the customer experienced their purchase.
With the web, everything changes. A customer’s actions can be captured, customer navigation and information can be presented for that customer, providing a personalized experience. Customers can record comments, suggestions, reviews, etc. Every customer visit is an opportunity to learn more about the customer and guide their shopping experience.
Purchases over the web can include 100s of steps, with a wealth of personalized data, dynamic content and navigation. Web sites can capture information relative to this customer – their product ratings, comments, shipping and packaging preferences, their lists. It is a much richer environment – how long did they stay, where did they look, what did they compare, how hard was it to get to the product. All of this information informs the store about how they are doing and can be re-used the next time the customer visits. Yes, you still need to capture the contents of the shopping cart, but that is only one aspect, albeit an important one, of the web retail experience.
How does the customer and the merchandiser interact over the web? Via personalization. What does that mean? Every step of the experience is personal and explicit for that customer. A few of the common interactions that people just expect: Personalized greeting,Product recommendations (based on history, based on market segments, based on friends, based on product trends, etc.). Gives you product comments and ratings. Remembers your lists – birthdays, anniversaries, special events, personal “I wanna” lists, etc. Notifies/reminds you of upcoming events and past experiences. Remembers shipping and payment information –> make it simple for me. Everyone has experienced this in one form or another on the web. The more these applications know, the more they can personalize the shopping experience for me. Every web page that I visit encapsulates an opportunity to provide personalized content.
Add of this personalized interaction is based on some very simple basic concepts: Rich Customer profiles. Each customer is represented by a rich customer profile with information specific to that customer, past history and recommendations for the future. These profiles are not static, they evolve over time. Capturing new types of information, new recommendations and continually adding new details that can be leveraged to further personalize the experience. Low latency simple data access to relevant data. A web page is a wealth of dynamic content, content that is generated on the fly by hundreds of individual queries. These queries need to be returned with ultra low latency because people will not wait for web pages. (Amazon: 10ms delay 1% loss of revenue). Additionally, not every web page (aka query) needs all of the information from a given customer profile. It only needs the information that is relevant. Scalability. Data only grows – catalogs, product information, customer profiles, historic data, customer ratings and comments, etc. Repositories need to be able to scale as the amount of data and processing increases.
But think about it… this is not just about retail. This is about any internet-based interaction where there is an opportunity to deliver a customized, personalized experience. This includes activities like online gaming and gambling, travel, etc. It includes interactions with people AND devices. Knowing the location, status and history of a device allows the application or service to provide more personalized or relevant content. For example, whenever I travel I look at two or three travel web sites. Why? Because I want to a) compare prices, b) get product ratings, c) get personalized recommendations. With cell phones, for example, knowing the location AND direction of travel allows a service to send personalized, relevant updates directly to my phone. If I’m on the east side of the city, I want to hear about promotions or traffic problems where I am now.
So, what’s the problem? The problem is that capturing, leveraging and evolving this data is complex. The data itself is simple. Managing it can be a challenge. Doing it scalably and cost effectively is hard. We’ve seen what happens when scalability or cost of operations are not considered – services become unresponsive or too expensive to maintain. Doing it in real-time or near-real-time without length delays in processing is harder still. For example – I may be male, live in the NW and over 50 years old. Does that mean I want to purchase Birkenstocks and Classic Rock and Roll? Not really. I’m much more than just my market segment. In fact I don’t own Birkenstocks and I prefer Folk Music. In order for me to use your web site, you have to do better than to put me in a general box. You need to know what I like. But I’m not your only customer. You have millions of customers just like me. And I’m not willing to wait 10 seconds for a web page on your site. In fact, I’m not likely to wait more than a second or two. Welcome to the “Real Time, I want it Now” generation.
How do all of these business requirements for personalized, customized web experiences translate into technology requirements? First of all, you need to integrate with existing Enterprise Systems – that’s ERP, CRM, DW, Analytics, etc. Secondly, you need low latency operations and transactions. Thirdly, because information evolves you also need the flexibility to change the data and the applications and the scalability of the solution as the needs of the business change. That flexibility also has to be cost effective in order to provide real business value.
If we look at the various storage options available to handle Big Data – there are essentially three types: Hadoop, NoSQL Databases, Relational Databases HDFS is a great distributed file system. Parallel, highly scalable and no inherent structure. However, it’s tuned primarily for bulk sequential read/write of file blocks. There are no indices for fast access to specific data records, it’s not well suited for lots of small files or updating files that have already been written. Primarily a batch system, write lots of data, then read it all in parallel over and over. Sounds like a datawarehouse, but more unstructured. The Relational Database on the other hand, is usually deployed on a big machine, and supports complex data structures stored in tables with plenty of relationships. Data is manipulated and accessed using rich SQL to build mission critical applications. There is support for variety of data access protocols like ODBC/JDBC along with an elaborate life cycle management infrastructure involving security and backup/restore operations. Enterprises run their mission critical transaction processing systems on relational databases. NoSQL database is the middle ground: a distributed key-value database with a simple data structure. It has indices. It can handle large volumes of data and is usually deployed on a distributed architecture consisting of several small machines. It’s designed for low latency high volume reads and writes of simple data, that is typical with real-time and web-scale specialized applications. It’s not tuned for reading/writing huge files – use a file system for that. It has flex configuration capabilities that make it very suitable to rapid application development requirements. Data scalability at low cost.
One of the important features of the Oracle NoSQL Database is that it supports transactions. Why are database transactions important in WS TP and P applications? Because “flaky” or “inconsistent” application actions will drive people elsewhere. Think about – if you visit a web site that sometimes works and sometimes doesn’t work, you’re likely to go elsewhere. I would. For example – let’s say that there's one item left in inventory and two online shoppers both put it in their cart. That’s fine because no one has purchased the item yet. However, this condition needs to be tracked and resolved at some point. This can be a challenge, especially in a globally distributed web application. In most NoSQL database products, the developer has to put special code in their application to handle it. Oracle NoSQL Database allows the developer to let the database handle the transaction consistency. Another example is the purchase of a shopping cart full of items. It is not acceptable for some, but not all of the items in the cart to be successfully processed. Application developers should be able to rely on the database, in this case Oracle NoSQL Database, to enforce the proper transaction behavior, where all or none of the items are purchased.
Oracle NoSQL Database provides both scalability – the ability to increase the size and throughput of a cluster AND predictable latency. We conducted this benchmark last year, working directly with our technology partners like Intel and Cisco. This graph summarizes the results of running a YCSB (Yahoo Cloud Services Benchmark) on Oracle NoSQL Database over a set of increasingly sized NoSQL Database clusters. The cluster started at 6 storage nodes (2 shards or partitions with 3 replicas on each) and grew to 12, 24 and 30 storage nodes, running on Intel’s Xeon E5-2690s, running a 95% read, 5% update workload. As you can see, as we added hardware (storage nodes) to the system, we were able to get a linear increase in throughput, while still maintaining very low latency. Adding HW increases throughput and capacity, without adding significant latency to the operations. At the end of the day, why is this important? Because it shows that a) Oracle NoSQL Database can grow as your business, storage and processing needs grow, and b) that increasing your hardware delivers the results that you would expect – more throughput without increased latency. Incidentally, this is more throughput that most companies need, running on a relatively small cluster. For example, Twitter does ~150K API calls per second.