Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 53 Anuncio

Data Lake Overview

Descargar para leer sin conexión

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Data Lake Overview (20)

Anuncio

Más de James Serra (18)

Más reciente (20)

Anuncio

Data Lake Overview

  1. 1. Data Lake Overview James Serra Data & AI Architect Microsoft, NYC MTC JamesSerra3@gmail.com Blog: JamesSerra.com
  2. 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  3. 3. Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Creating ADLS Gen2  Data Lake Use Cases
  4. 4. ? ? ? ? Big Data Architectures
  5. 5. Enterprise data warehouse augmentation • Seen when EDW has been in existence a while and EDW can’t handle new data • Data hub, not data lake • Cons: not offloading EDW work, can’t use existing tools, difficulty joining data in data hub with EDW
  6. 6. Data hub plus EDW • Data hub is used as temporary staging and refining, no reporting • Cons: data hub is temporary, no reporting/analyzing done with the data hub (temporary)
  7. 7. All-in-one Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ • Data hub is total solution, no EDW • Cons: queries are slower, new training for reporting tools, difficulty understanding data, security limitations
  8. 8. Modern Data Warehouse • Evolution of three previous scenarios • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  9. 9. INGEST STORE PREP & TRAIN MODEL & SERVE M O D E R N D A T A W A R E H O U S E Azure Data Lake Store Gen2 Logs (unstructured) Azure Data Factory Azure Databricks Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. Media (unstructured) Files (unstructured) PolyBase Business/custom apps (structured) Azure SQL Data Warehouse Azure Analysis Services Power BI
  10. 10. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP MODEL & SERVE (& store) Data orchestration and monitoring Big data store Transform & Clean Data warehouse AI BI + Reporting Azure Data Factory SSIS Azure Data Lake Storage Gen2 Blob Storage SQL Server 2019 Big Data Cluster Azure Databricks Azure HDInsight PolyBase & Stored Procedures Power BI Dataflows Azure SQL Data Warehouse Azure Analysis Services SQL Database (Single, MI, HyperScale, Serverless) SQL Server in a VM Cosmos DB Power BI Aggregations
  11. 11. ? ? ? ? Why data lakes?
  12. 12. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 12
  13. 13. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  14. 14. The three V’s
  15. 15. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value Use a data lake: All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit
  16. 16. ? ? ? ? Top-down vs Bottom-up
  17. 17. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  18. 18. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  19. 19. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  20. 20. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  21. 21. ? ? ? ? Data lake defined
  22. 22. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Centralized place for multiple subjects (single version of the truth) • Collect all data “just in case” (data hoarding) • Easy integration of differently-structured data • Store data with no modeling – “Schema on read” • Complements enterprise data warehouse (EDW) • Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention) • Hadoop cluster offers faster ETL processing over SMP solutions • Quick user access to data for power users/data scientists (allowing for faster ROI) • Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report • Allows use of Hadoop tools such as ETL and extreme analytics • Place to land IoT streaming data • On-line archive or backup for data warehouse data • With Hadoop/ADLS, high availability and disaster recovery built in • Keep raw data so don’t have to go back to source if need to re-run • Allows for data to be used many times for different analytic needs and use cases • Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW • Extreme performance for transformations by having multiple compute options each accessing different folders containing data • The ability for an end-user or product to easily access the data from any location
  23. 23. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  24. 24. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  25. 25. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  26. 26. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  27. 27. Needs data governance so your data lake does not turn into a data swamp!
  28. 28. Objectives  Plan the structure based on optimal data retrieval  Avoid a chaotic, unorganized data swamp Data Retention Policy Temporary data Permanent data Applicable period (ex: project lifetime) etc… Business Impact / Criticality High (HBI) Medium (MBI) Low (LBI) etc… Confidential Classification Public information Internal use only Supplier/partner confidential Personally identifiable information (PII) Sensitive – financial Sensitive – intellectual property etc… Probability of Data Access Recent/current data Historical data etc… Owner / Steward / SME Subject Area Security Boundaries Department Business unit etc… Time Partitioning Year/Month/Day/Hour/Minute Downstream App/Purpose Common ways to organize the data: Special thanks to: Melissa Coates CoatesDataStrategies.com
  29. 29. Example 1 Pros: Subject area at top level, organization-wide Partitioned by time Cons: No obvious security or organizational boundaries Raw Data Zone Subject Area Data Source Object Date Loaded File(s) ---------------------------------------------------- Sales Salesforce CustomerContacts 2016 12 01 CustContact_2016_12_01.txt Curated Data Zone Purpose Type Snapshot Date File(s) ----------------------------------- Sales Trending Analysis Summarized 2016_12_01 SalesTrend_2016_12_01.txt Organizing a Data Lake Thanks to Melissa Coates, www.CoatesDataStrategies.com
  30. 30. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  31. 31. ? ? ? ? Creating ADLS Gen2
  32. 32. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage A z u r e D a t a L a k e S t o r a g e G e n 2 M A N A G E A B L E S C A L A B L EF A S TS E C U R E  No limits on data store size  Global footprint (50 regions)  Optimized for Spark and Hadoop Analytic Engines  Tightly integrated with Azure end to end analytics solutions  Automated Lifecycle Policy Management  Object Level tiering  Support for fine-grained ACLs, protecting data at the file and folder level  Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration C O S T E F F E C T I V E I N T E G R AT I O N R E A D Y  Atomic file operations means jobs complete faster  Object store pricing levels  File system operations minimize transactions required for job completion
  33. 33. Blob Storage Data Lake Store Azure Data Lake Storage Gen2 Large partner ecosystem Global scale – All 50 regions Durability options Tiered - Hot/Cool/Archive Cost Efficient Built for Hadoop Hierarchical namespace ACLs, AAD and RBAC Performance tuned for big data Very high scale capacity and throughput Large partner ecosystem Global scale – All 50 regions Durability options Tiered - Hot/Cool/Archive Cost Efficient Built for Hadoop Hierarchical namespace ACLs, AAD and RBAC Performance tuned for big data Very high scale capacity and throughput
  34. 34. Missing blob storage features: - Archive and Premium tier - Soft Delete - Snapshots - Some features in preview https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues Missing ADLS Gen1 features: - Microsoft product support: ADC, Excel, AAS - 3rd-party products: Informatica, Attunity, Alteryx - Some features in preview https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
  35. 35. Azure Data Lake Store – Distributed File System ADLS File Files of any size can be stored because ADLS is a distributed system which file contents are divided up across backend storage nodes. A read operation on the file is also parallelized across the nodes. Blocks are also replicated for fault tolerance. Data Node 1 Block Data Node 2 Block Data Node 4 Block Data Node 3 Block The ideal file size in ADLS is 256MB – 2GB in size. Many very tiny files introduces significant overhead which reduces performance. This is a well-known issue with storing data in HDFS. Techniques: • Append-only data streams Thanks to Melissa Coates, www.CoatesDataStrategies.com
  36. 36. LRS Multiple replicas across a datacenter Protect against disk, node, rack failures Write is ack’d when all replicas are committed Superior to dual-parity RAID 11 9s of durability SLA: 99.9% GRS Multiple replicas across each of 2 regions Protects against major regional disasters Asynchronous to secondary 16 9s of durability SLA: 99.9% RA-GRS GRS + Read access to secondary Separate secondary endpoint RPO delay to secondary can be queried SLA: 99.99% (read), 99.9% (write) Zone 1 ZRS Replicas across 3 Zones Protect against disk, node, rack and zone failures Synchronous writes to all 3 zones 12 9s of durability Available in 8 regions SLA: 99.9% Zone 2 Zone 3
  37. 37. File Sync • Windows Srv <-> Azure • Local caching • With offline (Databox) can 'sync' remainder Fuse • Mount blobs as local FS • Commit on write • Linux Site Replication • On premise & cloud • Windows, Linux • Physical, virtual • Hyper-V, VMWare Network Acceleration • Aspera • Signiant AZCopy • Throughput +30% • S3 to Azure Blobs • Sync to cloud • Hi Latency 10-100% NetApp • CloudSync • SnapMirror • SnapVault Data Factory • On premise & cloud sources • Structured & unstructured • Over 60 connectors • UI design data flow Partners • Peer Global File Service • Talon FAST • Zerto • … Offline • Data Box • Data Box Heavy • Data Box Disk • Disk Import / Export Fast Data Transfer microsoft.com/en-us/garage/profiles/fast-data-transfer/
  38. 38. Data Box Heavy PREVIEW • Capacity: 1 PB • Weight 500+ lbs • Secure, ruggedized appliance • Same service as Data Box, but targeted to petabyte- sized datasets. Data Box Gateway • Virtual device provisioned in your hypervisor • Supports storage gateway, SMB, NFS, Azure blob, files • Virtual network transfer appliance (VM), runs on your choice of hardware. Data Box Edge • Local Cache Capacity: ~12 TB • Includes Data Box Gateway and Azure IoT Edge. • Data Box Edge manages uploads to Azure and can pre-process data prior to upload. Data Box • Capacity: 100 TB • Weight: ~50 lbs • Secure, ruggedized appliance • Data Box enables bulk migration to Azure when network isn’t an option. Data Box Disk • Capacity: 8TB ea.; 40TB/order • Secure, ruggedized USB drives orderable in packs of 5 (up to 40TB). • Perfect for projects that require a smaller form factor, e.g., autonomous vehicles. Order Fill UploadSend Return Cloud to Edge Edge to Cloud Pre-processing ML Inferencing Network Data Transfer Edge Compute Offline Data Transfer Online Data Transfer
  39. 39. https://azure.microsoft.com/en-us/features/storage-explorer/
  40. 40. Caching Layer (Avere tech) Extra Hot Tier - Premium (SSD + NVME) Hot Tier (HDD) Cool Tier (HDD) Cooler Tier (Pelican) Archive Tier (Tape) Deep Storage Tier (Glass, DNA, etc.) Analytics Engines (Hadoop, Spark, SCOPE …) High Performance Compute AI / ML Current Future Edge Azure File Sync Azure Backup Data Box Data Box Edge Azure Stack REST HDFS NFS SMB … Automatic Lifecycle Management Priority Retrieval [preview] 1hr vs 15 Avere FXT
  41. 41. ? ? ? ? Data Lake Use Cases
  42. 42. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake  Preparatory file storage for multi-structured data  Exploratory analysis + POCs to determine value of new data types & sources  Affords additional time for longer-term planning while accumulating data or handling an influx of data Raw Data Web Logs Images, Audio, Video Spatial, GPS Exploratory Analysis Data Science Sandbox Ingestion of New File Types Thanks to Melissa Coates, www.CoatesDataStrategies.com
  43. 43. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake  Sandbox solutions for initial data prep, experimentation, and analysis  Migrate from proof of concept to operationalized solution  Integrate with open source projects such as Hive, Pig, Spark, Storm, etc.  Big data clusters  SQL-on-Hadoop solutions Raw Data Spatial, GPS Exploratory Analysis Data Science Sandbox Hadoop Advanced Analytics Curated Data Analytics & Reporting Machine Learning Data Science Experimentation | Hadoop Integration Flat Files Thanks to Melissa Coates, www.CoatesDataStrategies.com
  44. 44. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake Data Warehouse Analytics & Reporting Cubes & Semantic Models Corporate Data Cloud Systems Third Party Data, Flat Files  ELT strategy  Reduce storage needs in relational platform by using the data lake as landing area  Practical use for data stored in the data lake  Potentially also handle transformations in the data lake Data Processing Jobs Raw Data: Staging Area Data Warehouse Staging Area Thanks to Melissa Coates, www.CoatesDataStrategies.com
  45. 45. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake Data Warehouse Analytics & Reporting Cubes & Semantic Models Corporate Data Cloud Systems Third Party Data, Flat Files  Grow around existing DW  Aged data available for querying when needed  Complement to the DW via data virtualization  Federated queries to access current data (relational DB) + archive (data lake) Raw Data: Staging Area Archived Data Integration with DW | Data Archival | Centralization Thanks to Melissa Coates, www.CoatesDataStrategies.com
  46. 46. Data Lake Use Cases Social Media Data Devices & Sensors Speed Layer Data Warehouse Analytics & Reporting Cubes & Semantic Models  Support for low-latency, high-velocity data in near real time  Support for batch- oriented operations Batch Layer Serving Layer Data Ingestion Data Processing Data Lake: Raw Data Corporate Data Data Lake: Curated Data Lambda Architecture Near Real-Time Thanks to Melissa Coates, www.CoatesDataStrategies.com
  47. 47. Q & A James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Notas del editor

  • The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
  • Fluff, but point is I bring real work experience to the session
  • Removed
    - SMP vs MPP
  • This scenario uses an enterprise data warehouse (EDW) built on a RDBMS, but will extract data from the EDW and load it into a big data hub along with data from other sources that are deemed not cost-effective to move into the EDW (usually high-volume data or cold data).  Some data enrichment is usually done in the data hub.  This data hub can then be queried, but primary analytics remain with the EDW.  The data hub is usually built on Hadoop or NoSQL.  This can save costs since storage using Hadoop or NoSQL is much cheaper than an EDW.  Plus, this can speed up the development of reports since the data in Hadoop or NoSQL can be used right away instead of waiting for an IT person to write the ETL and create the schema’s to ingest the data into the EDW.  Another benefit is it can support data growth faster as it is easy to expand storage on a Hadoop/NoSQL solution instead of on a SAN with an EDW solution.  Finally, it can help by reducing the number of queries on the EDW.

    This scenario is most common when a EDW has been in existence for a while and users are requesting data that the EDW cannot handle because of space, performance, and data loading times.

    The challenges to this approach is you might not be able to use your existing tools to query the data hub, as well as the data in the hub being difficult to understand and join and may not be completely clean. Not using big data repository as cleaning area, offloading from EDW

  • The data hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed in the EDW.  Hadoop or NoSQL is used to refine the data in the data hub.  Once refined, the data is copied to the EDW and then deleted from the data hub.

    This will lower the cost of data capture, provide scalable data refinement, and provide fast queries via the EDW.  It also offloads the data refinement from the EDW.

    Cons: data hub is temporary, no reporting or analyzing done with it
  • A distributed data system is implemented for long-term, high-detail big data persistence in the data hub and analytics without employing a EDW.  Low level code is written or big data packages are added that integrate directly with the distributed data store for extreme-scale operations and analytics.

    The distributed data hub is usually created with Hadoop, HBase, Cassandra, or MongoDB.  BI tools specifically integrated with or designed for distributed data access and manipulation are needed.  Data operations either use BI tools that provide NoSQL capability or low-level code is required (e.g., MapReduce or Pig script).

    The disadvantages of this scenario are reports and queries can have longer latency, new reporting tools require training which could lead to lower adoption, and the difficulty of providing governance and structure on top of a non-RDBMS solution.
  • An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.

    Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
  • 10
  • Key Points:
    Businesses can use new data streams to gain a competitive advantage.
    Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming.

    Talk Track:
    Does it not seem like every day there is a new kind of data that we need to understand?
    New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it.
    Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business.
    Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social.
    Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second.
    All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach:
    Collecting data on-premises with SQL Server
    SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing
    With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps.
    If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple.
    Capture new data types using the power and flexibility of the Microsoft Azure Cloud
    Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business.
    Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools.
    SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology.
    Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images.
    Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities.
    Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities.
    Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning.
    Document DB: a fully managed, highly scalable, NoSQL document database service
    Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data
    Azure Data Factory: enables information production by orchestrating and managing diverse data
    Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds
    Microsoft Analytics Platform System:
    In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse.
    But this traditional data warehouse is under pressure, hitting limits amidst massive change.
    Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights.
    They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments.
    Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer.
    The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance.
    Now you can collect relational and non-relational data in one appliance.
    You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.
     
    All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
  • All data has immediate or potential value
    This leads to data hoarding—all data is stored indefinitely
    With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation
    Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  • Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it
    Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data


    There are two approaches to doing information management for analytics:
    Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?
    Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?

    In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach.
    .
  • https://www.jamesserra.com/archive/2017/06/data-lake-details/

    https://blog.pythian.com/reduce-costs-by-adding-a-data-lake-to-your-cloud-data-warehouse/

    Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

    http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/

    http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/

    http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx

    http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/

    http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses

    http://www.martinsights.com/?p=1088

    http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/

    http://www.martinsights.com/?p=1082

    http://www.martinsights.com/?p=1094

    http://www.martinsights.com/?p=1102
  • Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training

    I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:
     
    - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
    - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS
    - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
    - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake
    - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
    - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
  • https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake
    https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning

    Question: Do you see many companies building data lakes?

    Raw: Raw events are stored for historical reference. Also called staging layer or landing area
    Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
    Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation
    Sandbox: Optional layer to be used to “play” in.  Also called exploration layer or data science workspace
  • Thanks to Melissa Coates, www.CoatesDataStrategies.com

  • Gen1 vs Gen2:
    https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
  • https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_3/
  • Transfer options:

    https://www.microsoft.com/en-us/garage/profiles/fast-data-transfer/

    Azure data box
  • https://azure.microsoft.com/en-us/services/databox/

×