Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Is the traditional data warehouse dead?

Ad

Is the traditionnel data
warehouse dead?
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
(Data Lake and Da...

Ad

About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/...

Ad

Agenda
 Data Warehouse
 Data Lake
 The best of both worlds
 Federated querying
 Patterns

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Eche un vistazo a continuación

1 de 47 Anuncio
1 de 47 Anuncio

Is the traditional data warehouse dead?

Descargar para leer sin conexión

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.

Más Contenido Relacionado

Is the traditional data warehouse dead?

  1. 1. Is the traditionnel data warehouse dead? James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com (Data Lake and Data Warehouse – the best of both worlds)
  2. 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  3. 3. Agenda  Data Warehouse  Data Lake  The best of both worlds  Federated querying  Patterns
  4. 4. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  5. 5. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  6. 6. Of course you still need a data warehouse A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". Reasons for a data warehouse:  Reduce stress on production system  Optimized for read access, sequential disk scans  Integrate many sources of data  Keep historical records (no need to save hardcopy reports)  Restructure/rename tables and fields, model data  Protect against source system upgrades  Use Master Data Management, including hierarchies  No IT involvement needed for users to create reports  Improve data quality and plugs holes in source systems  One version of the truth  Easy to create BI solutions on top of it (i.e. SSAS Cubes)
  7. 7. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Traditional Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  8. 8. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 14
  9. 9. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  10. 10. The three V’s
  11. 11. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 17
  12. 12. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  13. 13. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  14. 14. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements EDW • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data
  15. 15. Needs data governance so your data lake does not turn into a data swamp!
  16. 16. The real cost of Hadoop https://www.scribd.com/document/172491475/WinterCorp- Report-Big-Data-What-Does-It-Really-Cost/
  17. 17. A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  18. 18. • Query performance not as good as relational database • Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing • Concurrency limitations • No concept of “hot” and “cold” data storage with different levels of performance to reduce cost • Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security • File based so no granular security definition at the column level • No metadata stored in HDFS, so another tool required adding complexity and slowing performance • Finding expertise in Hadoop is very difficult • Super complex, with lot’s of integration with multiple technologies to make it work • Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard • Lack of master data management tools for Hadoop • Requires end-users to learn new reporting tools and Hadoop technologies to query the data • Pace of change is so quick many Hadoop technologies become obsolete, adding risk • Lack of cost savings: cloud consumption, support, licenses, training, and migration costs • Need conversion process to convert data to a relational format if a reporting tool requires it • Some reporting tools don’t work against Hadoop
  19. 19. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  20. 20. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  21. 21. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  22. 22. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  23. 23. Modern Data Warehouse • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  24. 24. Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying Complementary to DW Can be sourced from Data Lake
  25. 25. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  26. 26. Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  27. 27. Reasons you still need a cube/OLAP • Semantic layer • Handle many concurrent users • Aggregating data for performance • Multidimensional analysis • No joins or relationships • Hierarchies, KPI’s • Row-level security • Advanced time-calculations • Slowly Changing Dimensions (SCD)
  28. 28. ? ? ? ? Federated Querying
  29. 29. Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
  30. 30. SQL Server and PolyBase Query relational and non-relational data with T-SQL
  31. 31. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse Cloud Bursting BI + Reporting Azure Data Factory Azure Blob Storage Azure Databricks Azure Data Lake Azure HDInsight Azure Machine Learning Machine Learning Server Azure SQL Data Warehouse Azure Analysis Services
  32. 32. INGEST STORE PREP & TRAIN MODEL & SERVE Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Databricks Azure HDInsight Data Lake Analytics Analytical dashboards PolyBase Business/custom apps (Structured) Azure Analysis Services Azure Data Lake Store
  33. 33. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Data Lake Store Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Tableau Server PolyBase Operational Reports Ad-Hoc Query Azure SQL Database Hortonworks
  34. 34. https://aka.ms/ADAG
  35. 35. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Notas del editor

  • With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that?  No! In the presentation I'll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I'll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I'll put it all together by showing common big data architectures.

    http://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/

    https://www.slideshare.net/jamserra/big-data-architectures-and-the-data-lake

    https://www.slideshare.net/jamserra/differentiate-big-data-vs-data-warehouse-use-cases-for-a-cloud-solution
  • Fluff, but point is I bring real work experience to the session
  • No
    Pay the piper now or later
    The real question is…
    Dump files into data lake and tell user to go for it
  • Relational databases (RDBMS) generally work with structured data. Non-relational databases (NoSQL) work with semi-structured data

    Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  • Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it
    Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data


    There are two approaches to doing information management for analytics:
    Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?
    Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?

    In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach.
    .
  • One version of truth story: different departments using different financial formulas to help bonus

    This leads to reasons to use BI. This is used to convince your boss of need for DW

    Note that you still want to do some reporting off of source system (i.e. current inventory counts).

    It’s important to know upfront if data warehouse needs to be updated in real-time or very frequently as that is a major architectural decision

    JD Edwards has tables names like T117
  • The Data Warehouses leverages the top-down approach where there is a well-architected information store and enterprisewide BI solution. To build a data warehouse follows the top-down approach where the company’s corporate strategy is defined first. This is followed by gathering of business and technical requirements for the warehouse. The data warehouse is then implemented by dimension modelling and ETL design followed by the actual development of the warehouse. This is all done prior to any data being collected. It utilizes a rigorous and formalized methodology because a true enterprise data warehouse supports many users/applications within an organization to make better decisions.
  • Key Points:
    Businesses can use new data streams to gain a competitive advantage.
    Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming.

    Talk Track:
    Does it not seem like every day there is a new kind of data that we need to understand?
    New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it.
    Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business.
    Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social.
    Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second.
    All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach:
    Collecting data on-premises with SQL Server
    SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing
    With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps.
    If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple.
    Capture new data types using the power and flexibility of the Microsoft Azure Cloud
    Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business.
    Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools.
    SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology.
    Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images.
    Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities.
    Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities.
    Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning.
    Document DB: a fully managed, highly scalable, NoSQL document database service
    Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data
    Azure Data Factory: enables information production by orchestrating and managing diverse data
    Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds
    Microsoft Analytics Platform System:
    In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse.
    But this traditional data warehouse is under pressure, hitting limits amidst massive change.
    Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights.
    They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments.
    Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer.
    The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance.
    Now you can collect relational and non-relational data in one appliance.
    You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.
     
    All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
  • All data has immediate or potential value
    This leads to data hoarding—all data is stored indefinitely
    With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation
    Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

    http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/

    http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/

    http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx

    http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/

    http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses

    http://www.martinsights.com/?p=1088

    http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/

    http://www.martinsights.com/?p=1082

    http://www.martinsights.com/?p=1094

    http://www.martinsights.com/?p=1102
  • Inexpensively store unlimited data
    Collect all data “just in case”
    Easy integration of differently-structured data
    Store data with no modeling – “Schema on read”
    Complements enterprise data warehouse (EDW)
    Frees up expensive EDW resources, especially for refining data
    Hadoop cluster offers faster ETL processing over SMP solutions
    Quick user access to data
    Data exploration to see if data valuable before writing ETL and schema for relational database
    Allows use of Hadoop tools such as ETL and extreme analytics
    Place to land IoT streaming data
    On-line archive or backup for data warehouse data
    Easily scalable
    With Hadoop, high availability built in
    Allows for data to be used many times for different analytic needs and use cases
    Low-cost storage for raw data saving space on the EDW
  • https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake
    https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning

    Question: Do you see many companies building data lakes?

    Raw: Raw events are stored for historical reference. Also called staging layer or landing area
    Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
    Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation
    Sandbox: Optional layer to be used to “play” in.  Also called exploration layer or data science workspace
  • Question: Do you see many companies building data lakes?
  • I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo.  But are you as big as them?  Do you have their resources?  Do you generate data like them?  Do you want a solution that only 1% of the workforce has the skillset for?  Is your IT department radical or is it conservative?
  • does a data scientist or analyst think locally or globally?  Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases?  So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.
  • As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this.  The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question.  Another risk in the first case is slower performance because the data is not laid out efficiently.  Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).
  • http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/

    Not saying your EDW can’t consist of a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset? Radical vs conservative

    http://www.wintercorp.com/tcod-report/
  • Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training

    I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:
     
    - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
    - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS
    - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
    - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake
    - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
    - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
  • In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottoms-up approach becomes part of the top-down approach.

    The Top-down approach with the data warehouse utilizes a rigorous and formal approach to designing an enterprise wide data warehouse that can support the entire enterprise. It usually can answer questions that are backwards facing like what just happened or even answer why things happened.

    The bottoms-up approach with the data lake utilizes a exploratory and informal approach of collecting all data in a single place so that data scientists can do advanced analytics like leveraging Hadoop and machine learning tools. It usually can identify new opportunities, predict future outcomes, etc.

    In the ideal world, both are leveraged so that they can exploit information in the most valued way where each works together with the other to grow the business.
  • An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.

    Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
  • HDInsights benefits: Cheap, quickly procure

    Key goal of slide: Highlight the four main use cases for PolyBase.

    Slide talk track:
    There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.
    PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.
    PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.
    There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.
    Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
  • Question: Data warehouses and data lakes are now so fast, do we still need cubes?
  • Question: Do we need a relational database if we create a data lake?  So we don’t have to make another copy of the data?  Hive LLAP, Spark SQL, and Impala are so fast can’t we get away with just having a data lake? In other words, is the traditional data warehouse dead?


    I’m a little confused with the update to SQL DW that has unlimited columnar storage.  What was the limit before this change?  Does this change the maximum database size of 240TB?

    The max previously was governed by max db size of 240TB. We now are holding the columnstore data out on blob store and so the amount of that data we can hold is unlimited. We are still bound by the 240TB limit for page data so indexes and heaps are still capped today.


    Speed: Blob/ADLS/Local, size, versions, query type, data size, driver, front-end tool, concurrency, etc.

    Storage spaces

    Azure SQL Database Managed Instance will be added to this.
  • SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
  • http://demo.sqlmag.com/scaling-success-sql-server-2016/integrating-big-data-and-sql-server-2016

    When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.
  • https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/
  • Question: Should SQL Database be considered in the Model & Serve blade, using it as a data mart?
  • Four Reasons to Migrate Your SQL Server Databases to the Cloud: Security, Agility, Availability, and Reliability

    Reasons not to move to the cloud:
    Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house)
    Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider)
    Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own.  Or a concern that you share a cloud data center with other companies and someone from another company can be onsite near your servers)
    Compliance restrictions
    Regulations (health, financial)
    Legal restrictions (i.e. data can’t leave your country)
    Company policies
    You may be sharing resources on your server, as well as competing for system and network resources
    Data getting stolen in-flight (i.e. from the cloud data center to the on-prem user)
  • Question: Where should be do data transformations (data lake, relational database, Databricks, etc)?

    Question: What are the cost vs performance tradeoffs with our products? (many companies will sacrifice performance to save money)

×