Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Combining hadoop with big data analytics

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 11 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a Combining hadoop with big data analytics (20)

Más de The Marketing Distillery (20)

Anuncio

Más reciente (20)

Combining hadoop with big data analytics

  1. 1. Combining Hadoop with Big Data Analytics
  2. 2. Page 2 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How This E-Guide explores the growing popularity of leveraging Hadoop, an open source software framework, to tackle the challengesof big data analytics, aswellassome of the key businessand technicalbarriersto adoption. Gain expertadvice on the skill sets and rolesneeded for successful big data analyticsprograms,and discoverhoworganizations can design and assemble the rightteam. Hadoop for Big Data Puts ArchitectsonJourney of Discovery By: Jessica Twentyman, Contributor ―Problems don‘t care how you solve them,‖ said James Kobielus, an analyst with Forrester Research, in a recent blog on the topic of ‗big data‘. ―The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.‖ In recent years, that imperative -- to solve the problems posed by big data -- has led data architects at many organizations on a journey of discovery. Put simply, the traditional database and business intelligence tools that they routinely use to slice and dice corporate data are just not up to the task of handling big data. To put the challenge into perspective, think back a decade: few enterprise data warehouses grew as large as a terabyte. By 2009, Forrester analysts were reporting that over two-thirds of Enterprise Data Warehouses (EDWs) deployed were in the 1 to 10-terabyte range. By 2015, they claim, a majority of EDWs in large organizations will be 100 TB or larger, ―with petabyte-scale EDWs becoming well entrenched in sectors such as telecoms, financial services and web commerce.‖ What‘s needed to analyze these big data stores are special new tools and approaches that deliver ―extremely scalable analytics,‖ said Kobielus. By ―extremely scalable,‖ he means analytics that can deal with data that stands
  3. 3. Page 3 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How out in four key ways: for its volume (from hundreds of terabytes to petabytes and beyond); its velocity (up to and including real-time, sub-second delivery); its variety (diverse structured, unstructured and semi-structured formats); and its volatility (wherein scores to hundreds of new data sources come and go from new applications, services, social networks and so on). Hadoop for Big Data on the Rise One of the principal approaches to emerge in recent years has been Apache Hadoop, an open source software framework that supports data-intensive distributed analytics involving thousands of nodes and petabytes of data. The underlying technology was originally invented at search engine giant Google by in-house developers looking for a way to usefully index textual data and other ―rich‖ information and present it back to users in meaningful ways. They called this technology MapReduce – and today‘s Hadoop is an open-source version, used by data architects for analytics that are deep and computationally intensive. In a recent survey of enterprise Hadoop users conducted by Ventana Research on behalf of Hadoop specialist Karmasphere, 94% of Hadoop users reported that they can now perform analyses on large volumes of data that weren‘t possible before; 88% said that they analyze data in greater detail; and 82% can now retain more of their data. It‘s still a niche technology, but Hadoop‘s profile received a serious boost over that past year, thanks in part to start-up companies such as Cloudera and MapR that offer commercially licensed and supported distributions of Hadoop. Its growing popularity is also the result of serious interest shown by EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist Greenplum in June 2010; Teradata announced its acquisition of Aster Data in March 2011; and IBM announced its own Hadoop offering, Infosphere, in May 2011. What stood out about Greenplum for EMC was its x86-based, scale-out MPP, shared-nothing design, said Mark Sears, architect at the new
  4. 4. Page 4 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How organization EMC Greenplum. ―Not everyone requires a set-up like that, but more and more customers do,‖ he said. ―In many cases,‖ he continued ―these are companies that are already adept at analysis, at mining customer data for buying trends, for example. In the past, however, they may have had to query a random sample of data relating to 500,000 customers from an overall pool of data of 5 million customers. That opens up the risk that they might miss something important. With a big data approach, it‘s perfectly possible and relatively cost-effective to query data relating to all 5 million.‖ Hadoop Ill Understood by the Business While there‘s a real buzz around Hadoop among technologists, it‘s still not widely understood in business circles. In the simplest terms, Hadoop is engineered to run across a large number of low-cost commodity servers and distribute data volumes across that cluster, keeping track of where individual pieces of data reside using the Hadoop Distributed File System (HDFS). Analytic workloads are performed across the cluster, in a massively parallel processing (MPP) model, using tools such as Pig and Hive. Results, meanwhile, are delivered as a unified whole. Earlier this year, Gartner analyst Marcus Collins mapped out some of the ways he‘s seeing Hadoop used today: by financial services companies, to discover fraud patterns in years of credit-card transaction records; by mobile telecoms providers, to identify customer churn patterns; by academic researchers, to identify celestial objects from telescope imagery. ―The cost-performance curve of commodity servers and storage is putting seemingly intractable complex analysis on extremely large volumes of data within the budget of an increasingly large number of enterprises,‖ he concluded. ―This technology promises to be a significant competitive advantage to early adopter organizations.‖ Early adopters include daily deal site Groupon, which uses the Cloudera distribution of Hadoop to analyse transaction data generated by over 70 million registered users worldwide, and NYSE Euronext, which operates the
  5. 5. Page 5 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How New York stock exchange as well as other equities and derivatives markets across Europe and the US. NYSE Euronext uses EMC Greenplum to manage transaction data that is growing, on average, by as much as 2TB each day. US bookseller Barnes & Noble, meanwhile, is using Aster Data nCluster (now owned by Teradata) to better understand customer preferences and buying patterns across three sales channels: its retail outlets, its online store and e-reader downloads. Skills Deficit in Big Data Analytics While Hadoop‘s cost advantage might give it widespread appeal, the skills challenge it presents is more daunting, according to Collins. ―Big data analytics is an emerging field, and insufficiently skilled people exist within many organizations or the [wider] job market,‖ he said. Skills-related issues that stand in the way of Hadoop adoption include the technical nature of the MapReduce framework, which requires users to develop Java code, and the lack of institutional knowledge on how to architect the Hadoop infrastructure. Another impediment to adoption is the lack of support for analysis tools used to examine data residing within the Hadoop architecture. That, however, is starting to change. For a start, established BI tools vendors are adding support for Apache Hadoop: one example is Pentaho, which introduced support in May 2010 and, subsequently, specific support for the EMC Greenplum distribution. Another sign of Hadoop‘s creep towards the mainstream is support from data integration suppliers, such as Informatica. One of the most common uses of Hadoop to date has been as an engine for the ‗transformation‘ stage of ETL (extraction, transformation, loading) processes: the MapReduce technology lends itself well to the task of preparing data for loading and further analysis in both convential RDBMS-based data warehouses or those based on HDFS/Hive. In response, Informatica announced that integration is underway between EMC Greenplum and its own Data Integration Platform in May 2011.
  6. 6. Page 6 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How Additionally, there‘s the emergence of Hadoop appliances, as evidenced by EMC‘s Greenplum HD (a bundle that combines MapR‘s version of Hadoop, the Greenplum database and a standard x86 based server), announced in May, and more recently (August 2011), the Dell/Cloudera Solution (which combines Dell PowerEdge C2100 servers and PowerConnect switches with Cloudera‘s Hadoop distribution and its Cloudera Enterprise management tools). Finally, Hadoop is proving wellsuited to deployment in cloud environments, a fact which gives IT teams the chance to experiment with pilot projects on cloud infrastructures. Amazon, for example, offers Amazon Elastic MapReduce as a hosted service running on its Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The Apache distribution of Hadoop, moreover, now comes with a set of tools designed to make it easier to deploy and run on EC2. ―Running Hadoop on EC2 is particularly appropriate where the data already resides within Amazon S3,‖ said Collins. If the data does not reside on Amazon S3, he said, then it must be transferred: the Amazon Web Services (AWS) pricing model includes charges for network costs and Amazon Elastic MapReduce pricing is in addition to normal Amazon EC3 and S3 pricing. ―Given the volume of data required to run big data analytics, the cost of transferring data and running the analysis should be carefully considered,‖ he cautioned. The bottom line is that ―Hadoop is the future of the EDW and its footprint in companies‘ core EDW architectures is likely to keep growing throughout this decade,‖ said Kobielus. ‘Big Data’ Analytics Programs Require TechSavvy, Business Know-How By: Rick Sherman, Contributor Articles and case studies written about ―big data‖ analytics programs proclaim the successes of organizations, typically with a focus on the
  7. 7. Page 7 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How technologies used to build the systems. Product information is useful, of course, but it‘s also critical for enterprises embarking on these projects to be sure they have people with the right skills and experience in place to successfully leverage big data analytics tools. Knowledge of new or emerging big data technologies, such as Hadoop, is often required, especially if a project involves unstructured or semi-structured data. But even in such cases, Hadoop know-how is only a small portion of the overall staffing requirements; skills are also needed in areas such as business intelligence (BI), data integration and data warehousing. Business and industry expertise is another must-have for a successful big data analytics program, and many organizations also require advanced analytics skills, such as predictive analytics and data mining capabilities. Data scientist is a new job title that is getting some industry buzz. But it‘s really a combination of skills from the areas listed above, including predictive modeling, both statistical and business analysis, and data integration development. I‘ve seen some job postings for data scientists that require a statistical degree as well as an industry-specific one. Having both degrees and matching professional experience is great, but what are the chances of finding job candidates who do? Pretty slim! Taking a more realistic look at what to plan for, I describe below the key skill sets and roles that should be part of big data analytics deployments. Instead of listing specific job titles, I‘ve kept it general; organizations can bundle together the required roles and responsibilities in various ways based on the capabilities and experience levels of their existing workforce and new employees that they hire. Business knowledge. In addition to technical skills, companies engaging in big data analytics projects need to involve people with extensive business and industry expertise. That must include knowledge of a company‘s own business strategy and operations as well as its competitors, current industry conditions and emerging trends, customer demographics and both macro- and microeconomic factors.
  8. 8. Page 8 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How Much of the business value derived from big data analytics comes not from textbook key performance indicators but rather from insights and metrics gleaned as part of the analytics process. That process is part science (i.e., statistics) and part art, with users doing what-if analysis to gain actionable information about an organization‘s business operations. Such findings are only possible with the participation of business managers and workers who have firsthand knowledge of business strategies and issues. Business analysts. This group also has an important role to play in helping organizations to understand the business ramifications of big data. In addition to doing analytics themselves, business analysts can be tasked with things such as gathering business and data requirements for a project, helping to design dashboards, reports and data visualizations for presenting analytical findings to business users and assisting in measuring the business value of the analytics program. BI developers. These people work with the business analysts to build the required dashboards, reports and visualizations for business users. In addition, depending on internal needs and the BI tools that an organization is using, they can enable self-service BI capabilities by preparing the data and the required BI templates for use by business executives and workers. Predictive model builders. In general, predictive models for analyzing big data need to be custom-built. Deep statistical skills are essential to creating good models; too often, companies underestimate the required skill level and hire people who have used statistical tools, including models developed by others, but don‘t have knowledge of the mathematical principles underlying predictive models. Predictive modelers also need some business knowledge and an understanding of how to gather and integrate data into models, although in both cases they can leverage other people‘s more extensive expertise in creating and refining models. Another key skill for modelers to have is the ability to assess how comprehensive, clean and current the available data is before building models. There often are data gaps, and a modeler has to be able to close them.
  9. 9. Page 9 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How Don‘t be lulled into thinking that tools alone spell big data success. The real key to success is ensuring that your organization has the skills it needs. Data architects. A big data analytics project needs someone to design the data architecture and guide its development. Typically, data architects will need to incorporate various data structures into an architecture, along with processes for capturing and storing data and making it available for analysis. This role involves traditional IT skills such as data modeling, data profiling, data quality, data integration and data governance. Data integration developers. Data integration is important enough to require its own developers. These folks design and develop integration processes to handle the full spectrum of data structures in a big data environment. In doing so, they ideally ensure that integration is done not in silos but as part of a comprehensive data architecture. It‘s also best to use packaged data integration tools that support multiple forms of data, including structured, unstructured and semi-structured sources. Avoid the temptation to develop custom code to handle extract, transform and load operations on pools of big data; hand coding can increase a project‘s overall costs -- not to mention the likelihood of an organization being saddled with undocumented data integration processes that can‘t scale up as data volumes continue to grow. Technology architects. This role involves designing the underlying IT infrastructure that will support a big data analytics initiative. In addition to understanding the principles of designing traditional data warehousing and BI systems, technology architects need to have expertise in the new technologies that often are used in big data projects. And if an organization plans to implement open source tools, such as Hadoop, the architects need to understand how to configure, design, develop and deploy such systems. It can be exciting to tackle a big data analytics project and to acquire new technologies to support the analytics process. But don‘t be lulled into thinking that tools alone spell big data success. The real key to success is ensuring
  10. 10. Page 10 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How that your organization has the skills it needs to effectively manage large and varied data sets as well as the analytics that will be used to derive business value from the available data. About the Author: Rick Sherman is the founder of Athena IT Solutions, which provides consulting, training and vendor services on business intelligence, data integration and data warehousing. Sherman has written more than 100 articles and spoken at dozens of events and webinars; he also is an adjunct faculty member at Northeastern University‘s Graduate School of Engineering. He blogs at The Data Doghouse and can be reached at rsherman@athena- solutions.com.
  11. 11. Page 11 of 11 Sponsored by Combining Hadoop with Big Data Analytics Contents Hadoop for Big Data Puts Architects on Journeyof Discovery ‘Big Data’ Analytics Programs Require Tech Savvy, BusinessKnow- How Free resourcesfor technologyprofessionals TechTarget publishes targeted technology media that address your need for information and resources for researching products, developing strategy and making cost-effective purchase decisions. Our network of technology-specific Web sites gives you access to industry experts, independent content and analysis and the Web‘s largest library of vendor-provided white papers, webcasts, podcasts, videos, virtual trade shows, research reports and more —drawing on the rich R&D resources of technology providers to address market trends, challenges and solutions. Our live events and virtual seminars give you access to vendor neutral, expert commentary and advice on the issues and challenges you face daily. Our social community IT Knowledge Exchange allows you to share real world information in real time with peers and experts. What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of editors and network of industry experts provide the richest, most relevant content to IT professionals and management. We leverage the immediacy of the Web, the networking and face-to-face opportunities of events and virtual events, and the ability to interact with peers—all to create compelling and actionable information for enterprise IT professionals across all industries and markets. Related TechTarget Websites

×