In this webinar, we will discuss how Apache Hadoop works with your current infrastructure and how you can use data discovery and visualization tools to gain deeper insights from new data types stored in Hadoop and your existing data center investments.
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Data Discovery, Visualization, and Apache Hadoop
1. Data Discovery, VisualizationData Discovery, Visualization
and Apache Hadoopand Apache Hadoop
An InformationWeek WebcastAn InformationWeek Webcast
Sponsored bySponsored by
26. Q&AQ&A
Ted J. Wasserman
Product Manager
Tableau Software
John Kreisa
VP Strategic Marketing
Hortonworks
Lenny Liebmann
Contributing Editor
InformationWeek
27. ResourcesResources
To View This or Other Events On-Demand Please Visit:
http://www.informationweek.com/events
http://www.netseminar.com
For more information please visit:
http://hortonworks.com/products/hortonworks-sandbox/
Editor's Notes
For the visual thinkers out there, let’s expand our mathematical model to show some concrete examples. ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. Highly structured data in these systems is typically stored in SQL databases. Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data. Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data. Most folks would agree that video is “big” data. The analysis of what’s happening in that video (ie. What you, me, and others are doing in the video) may not be “big” but it is valuable and it does fit under our umbrella. Moreover, business data feeds and publicly available data sets are also “big data”. So we should not minimize our thinking to just data that flows through an organization. Ex. The mortgage-related data you may have COULD benefit from being blended with external data found in Zillow, for example. The government, for example, has the Open Data Initiative. Which means that more and more data is being made publicly available. One of the use cases I find interesting is the Predictive Policing use case where state/local law enforcement is using analytics applied to crime databases and other publicly available data to help predict where and when pockets of crime might be springing up. These proactive analytics efforts have yielded real reductions in crime! Anyhow, this is what Big Data means to me…hopefully it makes sense to you. It is important to note that we think of big data beyond the traditional concepts of volume, velocity and variety into transactions, interactions and observations. In reality, this IS the big data our customers are dealing with.
While overly simplistic, this graphic represents what we commonly see as a general data architecture: A set of data sources producing data A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses A set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications. Your environment is undoubtedly more complicated, but conceptually it is likely similar.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets). Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with: Existing applications – such as Tableau, SAS, Business Objects, etc, Existing databases and data warehouses for loading data to / from the data warehouse Development tools used for building custom applications Operational tools for managing and monitoring
It is for that reason that we focus on HDP interoperability across all of these categories: Data systems HDP is endorsed and embedded with SQL Server, Teradata and more BI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and more With Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with Hadoop For Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDP Operational tools Integration with System Center, and with Teradata viewpoint
So we ’ve covered the overall architecture and how Hadoop fits, let’s discuss the patterns of use that we’re seeing for using Hadoop. At a high level, we describe the 3 key patterns of use as Refine, Explore, and Enrich. Refine captures the data into the platform and transforms (or refines it) into the desired formats. Explore is about creating laks of data that you can interactively surf through to find valuable insights. Enrich is about leveraging analytics and models to influence your online applications, making them more intelligent. So while some categorize Hadoop as just a Batch platform, it is increasingly being used and evolving to serve a wide range of usage patterns that span Batch, Interactive, and Online needs. Let me cover these patterns in a little more detail.
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the: Core Services Platform Services Data Services Operational Services Required by the Enterprise user. And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop. We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community. We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data Platform Given our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support you Our approach is also uniquely endorsed by some of the biggest vendors in the IT market Yahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo ’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the world Microsoft has partnered with Hortonworks to include HDP in: HDP for Windows, HDInsight Server, and HDInsight Service