Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum

571 visualizaciones

Publicado el

Data is at the center of digital transformation; using data to drive action is how transformation happens. But data is messy, and it’s everywhere. It’s in the cloud and on-premises. It’s in different types and formats. By the time all this data is moved, consolidated, and cleansed, it can take weeks to build a predictive model.

Even with data lakes, efficiently integrating multi-structured data from different data sources and streams is a major challenge. Enterprises struggle with a stew of data integration tools, application integration middleware, and various data quality and master data management software. How can we simplify this complexity to accelerate and de-risk analytic projects?

The data warehouse—once seen as only for traditional business intelligence applications — has learned new tricks. Join James Curtis from 451 Research and Pivotal’s Bob Glithero for an interactive discussion about the modern analytic data warehouse. In this webinar, we’ll share insights such as:

- Why after much experimentation with other architectures such as data lakes, the data warehouse has reemerged as the platform for integrated operational analytics

- How consolidating structured and unstructured data in one environment—including text, graph, and geospatial data—makes in-database, highly parallel, analytics practical

- How bringing open-source machine learning, graph, and statistical methods to data accelerates analytical projects

- How open-source contributions from a vibrant community of Postgres developers reduces adoption risk and accelerates innovation

We thank you in advance for joining us.

Presenter : Bob Glithero, PMM, Pivotal and James Curtis Senior Analyst, 451 Research

Publicado en: Tecnología
  • Sé el primero en comentar

Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum

  1. 1. © Copyright 2018 Pivotal Software, Inc. All rights Reserved. Version 1.0 Simplifying Data and Analytics with Pivotal Greenplum 451 Research and Pivotal, Inc. May 24, 2018
  2. 2. Welcome! 2 James Curtis Sr Analyst, Data Platforms & Analytics james.curtis@451research.com @jmscrts www.451research.com Bob Glithero Principal Product Marketing Mgr linkedin.com/in/glithero @bglithero www.pivotal.io Bharath Sitaraman Principal Product Manager linkedin.com/in/bsitaraman @bharath1028 www.pivotal.io
  3. 3. Cover w/ Image Agenda ●  Expanding Analytics with EDW ●  Integrating Data for Analytical Transformation ●  Use Case: Layered Analytics in Cybersecurity ●  Q&A
  4. 4. 451 Research is a leading IT research & advisory company 4 Founded in 2000 300+ employees, including over 120 analysts 2,000+ clients: Technology & Service providers, corporate advisory, finance, professional services, and IT decision makers 70,000+ IT professionals, business users and consumers in our research community Over 52 million data points published each quarter and 4,500+ reports published each year 3,000+ technology & service providers under coverage 451 Research and its sister company, Uptime Institute, are the two divisions of The 451 Group Headquartered in New York City, with offices in London, Boston, San Francisco, Washington DC, Mexico, Costa Rica, Brazil, Spain, UAE, Russia, Taiwan, Singapore and Malaysia Research & Data Advisory Events Go 2 Market
  5. 5. 5 Becoming Data Driven, Analytics Driven
  6. 6. DECISION MAKERS DATA ANALYSTS IT PROSENTERPRISE APPLICATIONS DATA WAREHOUSE Enterprise Data Warehouse: Common Characteristics 6
  7. 7. Analytic Data Platforms: A Growing Market 7 Source: 451 Research, Market Monitor, Total Data: Platforms & Analytics, February 2018. 9.1% CAGR 2017-22
  8. 8. ENTERPRISE APPLICATIONS DECISION MAKERS DATA ANALYSTS IT PROSDATA WAREHOUSE 3 Adapt and Expand Our Field of Vision
  9. 9. ENTERPRISE APPLICATIONS CLOUD STORAGE DECISION MAKERS HADOOP SPARK AI+ML DATA ANALYSTS IT PROSDATA WAREHOUSE 3 Expanded Processing Choices
  10. 10. ENTERPRISE APPLICATIONS CLOUD STORAGE MOBILE APPS BOTS IOT DEVICES AND SENSORS SOCIAL MEDIA DECISION MAKERS HADOOP SPARK AI+ML DATA ANALYSTS IT PROS LOG AND CLICKSTREAM DATA DATA WAREHOUSE 3 Leads to Expansion of Data Sources
  11. 11. ENTERPRISE APPLICATIONS CLOUD STORAGE MOBILE APPS BOTS IOT DEVICES AND SENSORS SOCIAL MEDIA BUSINESS USERS DATA-DRIVEN APPLICATIONS DATA SCIENTISTS DECISION MAKERS HADOOP SPARK AI+ML DATA ANALYSTS IT PROS LOG AND CLICKSTREAM DATA OT USERS DATA WAREHOUSE 3 Which Leads to More Advanced Decision- Making Processes
  12. 12. CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE 3 Consider the Environment •  Too many systems to maintain •  Excessive data movement •  Analytics on a portion of data •  Duplicate capabilities •  Low utilization •  Low optimization
  13. 13. 3 Consolidate Analytical Frameworks •  Fewer systems to maintain •  Minimize data movement •  Analytic optimization •  Resource efficiency CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  14. 14. Consolidated Systems Enable In-Database Machine Learning 14 ✔︎ Operate on all of the data, including varied types ✔︎ ✔︎ ✔︎Algorithms optimized to the architecture No moving of data Leverage the use of SQL
  15. 15. 3 Level Set on Machine Learning !  The terms ‘algorithm’ and ‘model’ are often used to mean the same thing. They are not. !  An algorithm is a set of computational instructions, such as Random Forest. !  A model in that context would be the result of applying an Random Forest to a dataset —its output, which is based on the algorithm.
  16. 16. In-Database Machine Learning Works Best When You... 16 Understand the business problem (and rules) thoroughly 1 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  17. 17. In-Database Machine Learning Works Best When You... 17 Understand the business problem (and rules) thoroughly Have a decent amount of data that is consolidated 1 2 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  18. 18. In-Database Machine Learning Works Best When You... 18 Understand the business problem (and rules) thoroughly Have a decent amount of data that is consolidated Algorithms and tools for data analysis and preparation (optimized) 1 2 3 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  19. 19. In-Database Machine Learning Works Best When You... 19 Understand the business problem (and rules) thoroughly Have a decent amount of data that is consolidated Algorithms and tools for data analysis and preparation (optimized) Algorithms for machine learning development (optimized) 1 2 3 4 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  20. 20. In-Database Machine Learning Works Best When You... 20 Understand the business problem (and rules) thoroughly Have a decent amount of data that is consolidated Algorithms and tools for data analysis and preparation (optimized) Algorithms and tools to carrying out maintenance, validation, updating Algorithms for machine learning development (optimized) 1 2 3 4 5 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  21. 21. In-Database Machine Learning Works Best When You... 21 Understand the business problem (and rules) thoroughly Have a decent amount of data that is consolidated Algorithms and tools for data analysis and preparation (optimized) Algorithms and tools to carrying out maintenance, validation, updating Algorithms for machine learning development (optimized) Methods for machine learning model deployment 1 2 3 4 5 6 CLOUD STORAGE HADOOP SPARK AI+ML DATA WAREHOUSE
  22. 22. Key Takeaways 22
  23. 23. PIVOTAL Integrating Data for Analytical Transformation
  24. 24. Data is at the center of digital transformation; data-driven action is how transformation happens 24
  25. 25. 25 How do you get your arms around this?
  26. 26. So How Can We Use Data Effectively? ●  Over 30% of organizations have failed on big data projects ●  Recent research says it takes an average of 52 days to build a predictive model ●  Consolidating data and analytics in fewer environments simplifies modeling and deployment 26
  27. 27. 1. Converge Analytics and Data ●  Run algorithms in a database, as close to the data as possible ●  Leverage MPP architectures for rapid data science ●  Avoid ETL by moving data only when necessary ●  Integrate structured and unstructured data in one environment, reducing footprint of specialist databases 27
  28. 28. 2. Remove Friction from Data Science ●  Test as many hypotheses in parallel to find the most relevant features as quickly as possible ●  Don’t push/pull data into other environments, train and test, over and over... ●  Instead, develop a process that lets you ○  train a model with consolidated, cleansed data sets in the database ○  deploy the model as a pre-computed object ○  quickly retrain if data patterns change 28
  29. 29. 3. Choose a Data Platform for Rapid Analytics Some algorithms are iterative, like clustering or graphing Some algorithms can be parallelized, like random forests Sometimes patterns in the underlying data change, so models become obsolete 29
  30. 30. 4. Start with Standard Statistical Methods and ML A lot of useful data science can be done with standard algorithms Exotic algorithms need more data to train Standard algorithms can be combined into ensembles for greater predictive power Source: “A Few Useful Things to Know About Machine Learning,” Pedro Domingos, Communications of the ACM, October 2012 Starting with simpler algorithms and ensembles paves the way for more advanced data science 30
  31. 31. But Don’t Just Take Our Word for It... “As we built the ML model, we were surprised to learn that none of the most hyped data science tools — such as deep learning, AutoML, and ‘AI that creates AI’ — were needed to make it work.”
  32. 32. Pivotal Solutions for Data and Analytics Pivotal Greenplum Multi-Cloud, MPP data platform for complex analytics with diverse data locality and data types Pivotal GemFire Fast, transactional in-memory grid for rapid data refresh Complete portfolio Multi-Cloud and on premises Based on open source Flexible licensing Advanced data services Pivotal Cloud Cache On-demand in-memory caching for cloud native apps Pivotal Cloud Foundry Proven solution for operationalization of analytics and software-led, digital transformation Pivotal Data Science World-class Data Science consulting to drive more insights from data for Data- Driven Applications. Apache MADlib Distributed, in-database analytical library on large-scale data set. 32
  33. 33. Consolidating Diverse Data Enables New Use Cases Native Graph Relationship Intelligence Greenplum GPText Fast Search and Semantic Intelligence Greenplum PostGIS Location Intelligence Greenplum Integrated, Cleansed Data 33 New use cases for locations, flows, connections, relationships, and intent
  34. 34. Text Analytics with GPText Extracts content, structure from many binary formats Fast heterogeneous document indexing, search, and retrieval Massive parallelism for ●  Topic modeling ●  Named entity recognition ●  Term frequency ●  Stemming ●  Topic graph ●  Topic cloud ●  NLP 34 +
  35. 35. PIVOTAL Layered Analytics for Security
  36. 36. Attacks go unnoticed for long periods Insider Threats Increasingly Evade Network-Level Visibility Source: Verizon, 2017, n = 77 36
  37. 37. Layering Analytic Techniques for a More Complete Picture Understanding user/entity behavior (graphing, clustering, predictive analytics) Network-level intelligence (firewalls, IDS, SIEMs) Semantic understanding (e.g., text analytics, NLP) Understand activity Understand behavior Understand intent General Specific 37
  38. 38. Engineering-level view misses higher-level activity Understanding Activity - Network Intelligence ●  SIEM/IDS useful if activity matches a signature or rule ●  Small-ish data sets - APTs unfold over months or years ●  Inflexible schema - difficult to extend with user- level attributes ●  User is not same as device, IP address Log Collection Log Analysis Event Correlation Log Forensics Object Access Auditing Alerts Reports Log Monitoring Log Retention File Integrity Monitoring SIEM 38
  39. 39. Advanced Analytics Reveal Hidden Patterns Reveal latent/invisible patterns that stand out from normal behavior ●  Reconnaissance ●  Privilege escalation attempts ●  Access attempts ●  Unusual data flows or exfiltration attempts 39
  40. 40. Predictive Analytics for Understanding Behavior Chaining models increases predictive power, decreases false positives Lateral Movement Ensemble Model Outcomes Training data ●  Kerberos authentication events (source, destination, account, type, success/failure, etc.) ●  10K users, 13K nodes ●  >110M events •  Regression analysis •  Constrained diameter authentication graph •  Robust rank aggregation Model features ●  # of distinct destinations logged in to ●  # of distinct sources logged in from ●  # of distinct destination user accounts ●  # of distinct processes started by user Signals of credential takeover, running scripts, and other behavioral anomalies https://content.pivotal.io/blog/insider-threat-detection-detecting-variance-in-user-behavior-using-an-ensemble-approach 40
  41. 41. Ensemble of methods reveals hidden patterns that signal problem behavior Revealing the Needle in the Haystack Surge in access attempts at regular intervals indicates possible background script Regular surges in logins from unusual user accounts indicates possible account takeovers Graph reveals two access attempts to a sensitive server indirectly via other servers 41
  42. 42. Understanding Intent with Semantic Intelligence Semantic intelligence can clarify ambiguous signals from other analytics or security appliances Scan for interesting words, words in proximity, variations Scan and index content - “Is this document leaving the network similar to other known sensitive documents?” Scan document to suggest possible classification tags “sorry, i sent this file by mistake” ✔ “please don’t share this file outside the organization” ! Benign, no further action Investigate 42
  43. 43. Final Thoughts PIVOTAL
  44. 44. Digital Transformation is Real T-Mobile goes from 7 months and 72 steps to update software, to same day deployments. Liberty Mutual builds and deploys an MVP in one month and delivers revenue-generating version just months later. Comcast supports over 1500 developers with an operator team of 4 people. The Home Depot ships to production 1,500 times a month, and 17,000 times a month to all environments. Leading companies trust Pivotal as a transformation partner
  45. 45. 45
  46. 46. Cover w/ Image Data is the Key to Transformation ●  Consolidating data simplifies modeling and deployment ●  Executing massively parallel analytics in the database speeds complex use cases ●  Pivotal Greenplum integrates diverse data with machine learning at scale for faster value with less risk
  47. 47. Start Your Data Transformation Journey Today! Pivotal Greenplum pivotal.io/pivotal-greenplum Pivotal Data Science pivotal.io/data-science Apache MADlib madlib.apache.org Greenplum Database Channel
  48. 48. Data Tells the Story © Copyright 2018 Pivotal Software, Inc. All rights Reserved.

×