Designing IA for AI - Information Architecture Conference 2024
Big Data Analytics Webinar
1. Big Data Analytics:Profiling the Use of Analytic Platforms in User Organizations Wayne Eckerson Director of Research, Business Applications and Architecture Media Group TechTarget
3. Why Big Data? Changing data types Technology advances Insourcing & outsourcing Developers discover data
4. Analytics against Big Data Patterns Real-time Complex calculations Sustainable advantage
5. Framework for success Culture People Organization Architecture Analytic Platform Reporting Event-driven Data Governance BI Governance Performance Measurement IT professionals Fact-based Decisions Casual Users Analytics Analytics Center of Excellence Power Users Business Executives
6. Analytic Platforms An analytic platform is a data management system optimized for query processing and analytic that provides superior price-performance and availability compared with general purpose database management systems. Have you purchased or implemented an analytic platform as defined in this survey?
12. BI Delivery Framework 2020 Business Intelligence End-User Tools Dashboard Alerts Search, NoSQL, Java Reports and Dashboards Design Framework Universal Information Access Hadoop, Map Reduce Event detection and correlation MAD Dashboards Key-value pair indexes Architecture CEP, Streams Data Ware- housing Data Warehousing Reporting & Analysis Content Intelligence Event-Driven Alerts and Dashboards Continuous Intelligence Event-driven Analytic Sandboxes Analytic Sandboxes Ad hoc query, Spreadsheets, OLAP, Visual Analysis, Analytic Workbenches, Hadoop Ad hoc exploration Excel, Access, SAS, Visual Analysis Analytics Intelligence 12
13. Pros: -Alignment -Consistency Cons: -Hard to build -Politically charged -Hard to change - Expensive -“Schema Heavy” TOP DOWN- “Business Intelligence” Corporate Objectives and Strategy Reporting & Monitoring (Casual Users) Non-volatile data DW Architecture Predefined Metrics Reports Beget Analysis Analysis Begets Reports Pros: -Quick to build - Politically uncharged - Easy to change - Low cost Cons: -Alignment -Consistency --“Schema Light” Volatile data Ad hoc queries Analytics Architecture Analysis and Prediction (Power Users) Processes and Projects BOTTOM UP – “Analytics Intelligence”
14. BI Architecture - 2020 Operational Systems (Structured data) Operational System Extract, Transform, Load (Batch, near real-time, or real-time) Casual User Streaming/ CEP Engine Alerts Operational System Reports /Dashboards BI Server Data Warehouse Virtual Sandboxes Machine Data Dept Data Mart Hadoop Cluster Top-down Architecture Bottom-up Architecture Web Data In-memory BI Sandbox Ad hoc query Upload & query Audio/video Data Free- Standing Sandbox Query & report Ad hoc query Analytical platform or non-relational database External Data Ad hoc query Power User Documents & Text
15. BI Architecture - 2020 Operational Systems (Structured data) Operational System Extract, Transform, Load (Batch, near real-time, or real-time) Casual User Streaming/ CEP Engine Alerts Operational System Reports Dashboards BI Server Data Warehouse Virtual Sandboxes Machine Data Dept Data Mart Hadoop Cluster Top-down Architecture Bottom-up Architecture Web Data In-memory BI Sandbox Ad hoc query Upload & query Audio/video Data Free- Standing Sandbox Query & report Ad hoc query Analytical platform or non-relational database External Data Ad hoc query Power User Documents & Text
16. Recommendations Harmonize top down and bottom up BI Implement a BI architecture that supports multiple intelligences Create multiple types of analytic sandboxes Implement analytic platforms that meet business and technical requirements
Notas del editor
Welcome to this Webcast on Big Data Analytics. My name is Wayne Eckerson, a long-time industry analyst and thought leader in the business intelligence market. I will be your speaker today. One housekeeping item before we begin. This is a prerecorded Webcast so there will be no Q&A session at the end If you have questions for me, please don’t hesitate to send me an email at weckerson@gmail.com. I’d be happy to dialogue with you about this important topic! The research and findings that I will present in this Webcast are based on a report that you can download for free from the BeyeNetwork web site or from Bitpipe. It’s a 40-page report so I hope you take the time to peruse through its details. This 60-minute webcast will present highlights from that report. First, I’ll talk about the big data analytics movement, what’s behind it, what it is, and best practices for doing it. Second, I’ll talk about big data analytics engines. I’ll explain the technology most of these engines use to turbo-charge analytical queries and then catalog vendors in the space. Third, I’ll lump analytic engines into four categories and present survey results that show what causes customers to buy each category of product. Finally, and perhaps most importantly, I’ll describe a framework for implementing big data analytics and show how to expand your existing business intelligence and data warehousing architecture to handle new requirements. So with that, let’s begin.
I’d like to thank our sponsors who made the research and this webcast possible.
There has been a lot of talk about “big data” in the past year, which I find a bit puzzling. I’ve been in the data warehousing field for more than 15 years, and data warehousing has always been about big data. So what’s new in 2011? Why are we are talking about “big data” today? There are several reasons: Changing data types. Organizations are capturing different types of data today. Until about five years ago, most data was transactional in nature, consisting of numeric data that fit easily into rows and columns of relational databases. Today, the growth in data is fueled by largely unstructured data from wWebsites as well as machine-generated data from an exploding number of sensors. Technology advances. Hardware has finally caught up with software. The exponential gains in price/-performance exhibited by computer processors, memory, and disk storage have finally made it possible to store and analyze large volumes of data at an affordable price. Organizations are storing and analyzing more data because they can.! Insourcingand outsourcing. Because of the complexity and cost of storing and analyzing Web traffic data, most organizations have outsourced these functions to third-party service bureaus. But as the size and importance of corporate e-commerce channels have increased, many are now eager to insource this data to gain greater insights about customers. At the same time, virtualization technology is making it attractive for organizations to move large-scale data processing to private hosted networks or public clouds. Developers discover data. The biggest reason for the popularity of the term “big data” is that Web and application developers have discovered the value of building a new data-intensive applications. To application developers, “big data” is new and exciting. Of course, for those of us who have made their careers in the data world, the new era of “big data” is simply another step in the evolution of data management systems that support reporting and analysis applications.
Big data by itself, regardless of the type, is worthless unless business users do something with it that delivers value to their organizations. That’s where analytics comes in. Although organizations have always run reports against data warehouses, most haven’t opened these repositories to ad hoc exploration. This is partly because analysis tools are too complex for the average user but also because the repositories often don’t contain all the data needed by the power user. But this is changing. Patterns. A valuable characteristic of ““big data”” is that it contains more patterns and interesting anomalies than “small” data. Thus, organizations can gain greater value by mining large data volumes than small ones. Fortunately, techniques already exist to mine big data thanks to companies, such as SAS Institute and SPSS (now part of IBM), that ship analyticalworkbenches. Real-time. Organizations that accumulate big data recognize quickly that they need to change the way they capture, transform, and move data from a nightly batch process to a continuous process using micro batch loads or event-driven updates. This technical constraint pays big business dividends because it makes it possible to deliver critical information to users in near-real-time. Complex analytics. In addition, during the past 15 years, the “analytical IQ” of many organizations has evolved from reporting and dashboarding to lightweight analysis. Many are now on the verge of upping their analytical IQ by implementing predictive analyticsagainst both structured and unstructured data. This type of analyticscan be used to do everything from delivered highly tailored cross-sell recommendations to predicting failure rates of aircraft engines. Sustainable advantage . At the same time, executives have recognized the power of analytics to deliver a competitive advantage, thanks to the pioneering work of thought leaders, such as Tom Davenport, who co-wrote the book,“Competing on Analytics.” In fact, forward-thinking executives recognize that analytics may be the only true source of sustainable advantage since it empowers employees at all levels of an organization with information to help them make smarter decisions.
However, the road to big data analytics is not easy and success is not guaranteed. Analytical champions are still rare. That’s because succeeding with big data analytics requires the right culture, people, organization, architecture, and technology.The right culture. Analytical organizations are championed by executives who believe in making fact-based decisions or validating intuition with data. These executives create a culture of performance measurement in which individuals and groups are held accountable for the outcomes of predefined metrics aligned with strategic objectives. The right people. You can’t do big data analytics without power users, or more specifically, business analysts, analytical modelers, and data scientists. These folks possess a rare combination of skills and knowledge: Tthey have a deep understanding of business processes and the data that sits behind those processes and are skillful in the use of various analytical tools, including Excel, SQL, analytical workbenches and coding languages. The right organization. Historically, analysts with the aforementioned skills were pooled in pockets of an organization hired by department heads. But analytical champions create a shared service organization (i.e., an analytical center of excellence) that makes analytics a pervasive competence. Analysts are still assigned to specific departments and processes, but they are also part of a central organization that provides collaboration, camaraderie, and a career path. Analytic platform. At the heart of an analytical infrastructure is an analytic platform, the underlying data management system that consumes, integrates, and provides user access to information for reporting and analysis activities. Today, many vendors, including most sponsors of this Webinar, provide specialized analytic platforms that provide dramatically better query performance than existing systems. There are many different types of analytic platforms sold by dozens of vendors.
So what is an analytic platform? It’s a data management system optimized for query processing and analytics that provides superior price-performance and availability compared with general purpose database management systems. Given this definition, 72% of our survey respondents said they already have an analytic platform. This is a surprisingly high percentage given that these platforms, except for Teradata and Sybase IQ, have only been generally available for the past five years or so. In looking at the survey responses, I did see a lot of Microsoft customers who think SQL Server fits this definition, which is doesn’t. Nevertheless, I think the results speak volumes for the power of these analytic platforms to optimize the performance of analytical applications.
Analytic platforms offer superior price-performance for many reasons. And while product architectures vary considerably, most support the following characteristics: Massively parallel processing (MPP). Most analytic platforms spread data across multiple nodes, each containing their own CPU, memory, and storage and connected to a high-speed backplane. When a user submits a query or runs an application, the “shared nothing” system divides the work across the nodes, each of which process the query on their piece of the data and ship the results to a master node that assembles the final result and sends it to the user. MPP systems are highly scalable, since you simply add nodes to increase processing power. Balanced configurations. Analytic platforms optimize the configuration of CPU, memory, and disk for query processing rather than transaction processing. Analytic appliances essentially “hard wire” this configuration into the system and don’t let customers change it, whereas analytic bundles or analytic databases (i.e., software-only solutions) allow customers to configure the underlying hardware to match unique application requirements. Storage-level processing. Netezza’s big innovation was to move some database functions, specifically data filtering functions, into the storage system using field programmable gate arrays. This storage-level filtering reduces the amount of data that the DBMS has to process, which significantly increases query performance. Many vendors have followed suit, moving various databases functions into hardware. Columnar storage and compression. Many vendors have followed the lead of Sybase, Sand Technology, Paraccel, and other columnar pioneers, by storing data in columns not rows. Since most queries ask for a subset of columns in a row rather than all rows, storing data in columns minimizes the amount of data that needs to be retrieved from disk and processed by the database, accelerating query performance. In addition, since data elements in many columns are repeated (e.g., “male” and “female” in the gender field), column-store systems can eliminate duplicates and compress data volumes significantly, sometimes as much as 10:1. This enables more data to fit into memory, which speeds processing. Memory. Many analytic platforms make liberal use of memory caches to speed query processing. Some products, such as SAP HANA and QlikTech’sQlikView, store all data in memory, while others store recently queried results in a smart cache so others who need to retrieve the same data can pull it from memory rather than from disk. Given the growing affordability of memory and the widespread deployment of 64-bit operating systems, many analytic platforms are expanding their memory footprints to speed processing.Query Optimizer. Analytic platform vendors invest a lot of time and money researching ways to enhance their query optimizers to handle various workloads. A good query optimizer is the biggest contributor query performance. In this respect, the older vendors with established products have an edge. Plug-in Analytics. True to their name, many analytic platforms offer built-in support for complex analytic. This includes complex SQL, such as correlated subqueries, as well as procedural code implemented as plug-ins to the database. Some vendors offer a library of analytical routines, from fuzzy matching algorithms to market-basket calculations. Some, like Aster Data (now owned by Teradata) provide native support for MapReduce programs that are called using SQL.
MPP Analytic Databases - Row-based databases designed to scale out on a cluster of commodity servers and run complex queries in parallel against large volumes of data. Columnar Databases - Database management systems that stores data in columns, not rows, and support high data compression ratios. Analytic Appliances - Preconfigured hardware-software system designed for query processing and analytic that requires little tuning.Analytic Bundles - Predefined hardware and software configurations that are certified to meet specific performance criteria, but the customer must purchase and configure themselves.In-memory Database - System that loads data into memory to execute complex queries. Distributed file-based systems - Designed for storing, indexing, manipulating and querying large volumes of unstructured and semi-structured dataAnalytic Services - Analytic platform delivered as a hosted or public-cloud-based service. Nonrelational databases - Optimized for querying unstructured data as well as structured data. CEP/Streaming Engines - Ingest, filter, calculate, and correlate large volumes of discrete events and apply rules that trigger alerts when conditions are met.
Our survey grouped analytic platforms into four major categories to make it easier to compare and contrast various product offerings:Analytic databasesare software-only analytic platforms that run on a variety of hardware that customers purchase. Customers install, configure and tune software, including the analytic database, before they can use the analytic system. Most MPP analytic databases, columnar databases, and in-memory databases qualify as analytic databases. As a rule of thumb, analytic databases are good for organizations that want to tune database performance for specific workloads or run the RDBMS software on a virtualized private cloud. Analytic appliances: These are hardware-software combinations designed to support ad hoc queries and other types of analytic processing. This category includes both analytic appliances and analytic bundles. As a rule of thumb, analytic appliances are fast to deploy and easy to maintain and make good replacements for Microsoft SQL Server or Oracle data warehouses that have run out of gas. They also make great standalone data marts to offload complex queries from large, maxed-out data warehousing hubs. Analytic services: Rather than deploy an analytic platform in a customer’s data center, an analytic service enables customers to house the system in an off-site hosted environment or public cloud. As a rule of thumb, analytic services are great for development, test and prototyping applications as well as for organizations that don’t have an IT department or want to outsource data center operations or get up and running very quickly. File-based analytic system: This generally refers to Hadoop, but we also lumped NoSQL or nonrelational systems into this category, although it’s not entirely accurate since nonrelational systems are databases. However, since both are used to store and analyze large volumes of unstructured data and don’t’ require an up-front schema design, they share more similarities than differences. As a rule of thumb, this category of products are ideal for processing large volumes of Web traffic and other log-based or machine-generated data.
When examining the business requirements driving purchases of analytic platforms overall, three percolate to the top: “faster queries,” “storing more data” and “reduced costs.” These requirements are followed by “more complex queries,” “higher availability” and “quicker to deploy.” This ranking is based on summing the percentages of all four deployment options for each requirement.More important, this chart shows that customers purchase each deployment option for different reasons. Analytic database customers value “quick to deploy” (46%), “built-in analytic” (43%) and “easier maintenance” (41%) more than other requirements, while analytic service customers favor “storing more data” (67%), “high availability” (67%), “reduced costs” (56%) and “more concurrent users” (56%). Not surprisingly, customers with file-based systems look for the ability to support “more diverse data” (64%) and “more flexible schemas” (64%), two hallmarks of a Hadoop/NoSQL offering. Analytic appliance customers had the most emphatic requirements. Almost two-thirds value faster queries (70%), more complex queries (64%) and faster load times (63%), suggesting that analytic appliance customers seek to offload complex ad hoc queries from data warehouses.
We also asked respondents if they were looking for a specific deployment option when evaluating products (see Figure 14). Except for customers of file-based systems, most customers investigated products across these four categories. For example, Blue Cross Blue Shield of Kansas City looked at three columnar databases (i.e., software-only) and an appliance before making a decision. Interestingly, no analytic service customers intended to subscribe to a service prior to evaluating products. That’s because many analytic-service customers subscribe to such services on a temporary basis, either to test or prototype a system or to wait until the IT department readies the hardware to house the system. Some of these customers continue with the services, recognizing that they provide a more cost-effective test and development environment than an in-house system.
Now that we’ve discussed the engines that drive big data analytics, let’s step back a bit and look at the overall framework in which they operate. I introduced this BI Delivery Framework 2020 in March. It’s basically my vision for what BI environments will look like in about 10 years. Instead of one intelligence and BI architecture to support reporting and analysis applications depicted in the middle, there will be four intelligences. Let me briefly describe each. Business Intelligence represents a classic data warehousing environment that delivers reports and dashboards primarily to casual users via a MAD framework. MAD stands for….Moving to the right, Continuous Intelligence delivers near real-time information and alerts to operational workers using event-driven architectures that handle simple and complex events. At the bottom, Analytics Intelligence enables power users to submit ad hoc queries against any data source using a variety of tools, ideally supported by analytic sandboxes built into the top-down environment. To the left, Content Intelligence makes unstructured data an equal target for reporting and analysis applications. These systems using a variety of indexing technologies to store both structured and unstructured data and allow users to submit queries against them. This is a fast growing area that encompasses Hadoop, NoSQL, and search-based technologies. If you want more information on this framework, please download my first report, titled Analytic Architectures: BI Delivery Framework 2020 from BeyeNEtwork’s Web site. But before leaving the framework, I want to drill down on business intelligence and analytics intelligence, the two most inter-related intelligences in this framework, and the two most problematic to manage synergistically.
This is another depiction of the two intelligences. As I already mentioned, Business Intelligence is a top-down environment that delivers reports and dashboards to casual users. The output is based on predefined metrics aligned with strategic goals and objectives. In other words, in a top-down environment, you know in advance what questions users want to ask and you model the environment accordingly. The benefits of this environment is that it delivers information consistency and alignment – the proverbial single version of truth. The downsides are that it’s hard to build, hard to change, costly, and politically charged. In contrast, Analytics Intelligence is the opposite. It’s a bottom-up environment geared to power users who submit ad hoc queries against a variety of sources to optimize processes and projects. This ad hoc environment is quick to build, easy to change, low cost, and politically uncharged. Yet, it creates myriad analytic silos and thus, it forfeits information alignment and consistency. The problem here is that most companies try to do all BI in either a top-down or bottom-up environment. They may start with a top-down and get discouraged that it’s expensive and not geared to ad hoc types of requests. So they abandon it in favor of analytics intelligence, which works fine for awhile until they realize they are overwhelmed with analytic silos and don’t have a common understanding of business performance. The first key here is to recognize that you need both top down and bottom up environments. They are synergistic. Analysis begets reports and reports beget analysis. You do some analysis, find something interesting, and turn it into a regularly scheduled report for others to see. But that report should trigger additional questions, which call for additional analysis, and so on. The second key is to apply the right architecture to the right tasks. Typically top down environments address 80% of your information requirements and bottom-up 20%. Yet, bottom-up may uncover 80% of your most valuable insights. Both are equally important and must be treated equivalently when building your corporate information architecture.
So here’s the architecture behind the BI Delivery Framework 2020. Let me step you through this: What’s pictured in below is the classic top-down business intelligence and data warehousing environment that most organizations have already built. …..What’s pictured in pink are new components that address the other three intelligences.To the left are new sources of data that aren’t typically loaded into DWs: Machine generated data, Web data, unstructured data and external data. In front of these sources is a Hadoop cluster, which is ideal for processing in batch large volumes of unstructured and semi-structured data, although it also can manage structured dataAtop the DW is the streaming/complex event processing engine for handling continuous intelligence and alerting.Below the DW is a free-standing database or sandbox that offloads bottom-up analytic processing from the DW, if desired. To the right and bottom is the power user, who traditionally has been left out of classic BI/DW architecture. Now, they have access to five types of analytic sandboxes designed to support ad hoc query processing as well as external data, if they have permission.
This is the same BI architecture but with the five sandboxes highlighted in green. A virtual sandbox inside the DW is a set of dedicated partitions into which analysts can upload their own data and mix it with corporate data. To avoid contention among DW resources, many companies create a free-standing sandbox to house data or users which the DW can’t support. Basically, this option offloads complex processing to a separate machine. A local in-memory BI tool can also serve as a sandbox as long as it requires analysts to publish their findings to an IT-managed server rather than proliferate spreadmarts. Hadoop is a sandbox because it allows power users who know the atomic data well and can write code to submit queries against large volumes of unstructured or structured data.Like Hadoop, a DW can be a sandbox for those power users whom IT trusts to write well-designed SQL that doesn’t bog down performance for others
So, to wrap up, I have five recommendations for supporting big data analytics: #1. Harmonize top down and bottom up BI. For too long, organizations have tried to shoehorn all types of users into a single information architecture. That has never worked. Organizations need to recognize that casual users need top-down, interactive reports and dashboards, while power users need ad hoc exploratory tools and environments. #2. Implement a BI architecture that supports multiple intelligences The BI architecture of the future supports both traditional data warehousing to handle detailed transactional data and file-based and nonrelational systems to handle unstructured and semi-structured data. It also supports continuous intelligence through CEP and streaming engines and analytical sandboxes for ad hoc exploration. #3. Create multiple types of analytic sandboxes. Analytic sandboxesbring power users more fully into the corporate data environment by enabling them to mix personal and corporate data and run complex, ad hoc queries with minimal restrictions. #4. Implement analytic platforms that meet business and technical requirements. There are four broad types of analytic platforms. Pick the one that is right for you. Appliances are quick to deploy and easy to maintain; analytic databases provide flexibility to run the software on the hardware of your choice; analytic services forego the time and cost of provisioning software in your own data center if you have one; and file-based systems are ideal for processing unstructured and semi-structured data.