Descubriendo los Datos - Bodegas de datos

•Descargar como PPTX, PDF•

0 recomendaciones•297 vistas

Este documento presenta una introducción a los almacenes de datos y las nuevas tendencias en este campo. Explica brevemente los conceptos clave de los almacenes de datos como dimensiones, tablas de hechos y ETL. También describe algunas tecnologías emergentes como Azure SQL Data Warehouse y Azure Data Lake que permiten el autoservicio de BI y el análisis de datos no estructurados. El documento concluye invitando preguntas y comentarios.

Datos y análisis

Bodegas de Datos
Julián Castiblanco P
Líder de la comunidad ITPROS-DC
https://www.facebook.com/ITProsDC
http://www.meetup.com/ITPROS-DC/

Julián Castiblanco P.
http://www.azurecloud.com.co/
http://julycastiblanco.blogspot.com.co/
co.linkedin.com/juliancastiblancop
@jcastiblancop
Julian_castiblancop@hotmail.com
Database Consultant- Synergy TPC
MVP Data Platform
Miembro de PASS ITPros-DC Chapter

Agenda
• Algo de teoría Básica
• Qué son los paquetes ETLs
• Nuevas tendencias

Diferencias en diseño
Bodegas de datosSistemas transaccionales
ER Diagram

Llaves Sustitutas Vs Llaves de Negocio
CustomerKey CustomerAltKey FirstName LastName
1 1002 Amy Alberts
2 1005 Neil Black
Llave Suplente Llave de Negocio Atributos Adicionales de la dimensión

Atributos y Jerarquías
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Jerarquías FiltrosDetalles Detalles

Dimensiones Lentamente Cambiantes
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
Tipo 1
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
Tipo 2
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
Tipo 3

Dimensión de Tiempo
• Granularidad
• Rangos
• Múltiples Calendarios
• Incluir una fecha por
defecto
DateKey DateAltKey MonthDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL
20130101 01-01-2013 1 Tue 01 Jan 2013
20130102 01-02-2013 2 Wed 01 Jan 2013
20130103 01-03-2013 3 Thu 01 Jan 2013
20130104 01-04-2013 4 Fri 01 Jan 2013

Dimensiones Auto-Referenciadas
EmployeeK
ey
EmployeeAltK
ey
EmployeeNa
me
ManagerKe
y
1 1000 Manuel NULL
2 1001 Julio 1
3 1002 Cesar 1
4 1003 Dora 2
Manuel
Julio
Dora
Cesar

Dimensiones Chatarra
Agrupa características o
dimensiones relacionadas y
pequeñas en una sola
dimensión para simplificar
el modelo Estrella y
mejorar los tiempos de
respuesta de la bodega de
datos.
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit

Columnas de la tabla de hechos
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
Llaves de las dimensiones
Dimensiones
Degeneradas Medidas

Tipos de Tablas de Hechos
• Tabla de hechos a nivel de transacción
• Foto de Periodo Por Fecha
• Foto De Periodo Acumulado
OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount
20120101 25 120 1000 1 125.00 350.99
20120101 99 120 1000 2 2.50 6.98
20120101 25 178 1001 2 250.00 701.98
DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock
20120101 25 25 1 3 23
20120101 99 120 0 2 118
OrderNo OrderDateKey ShipDateKey DeliveryDateKey
1000 20120101 20120102 20120105
1001 20120101 20120102 00000000
1002 20120102 00000000 00000000

Que viene Ahora
AUTOSERVICIO
DE BI
Fuentes no
Normalizadas
Cruces de
informaciones no
homogéneas
Tiempos mínimos
para la ejecución
de proyectos

Azure SQL Data Warehouse
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-overview-what-is/

Azure Data Lake
https://azure.microsoft.com/en-us/solutions/data-lake/

PREGUNTAS / COMENTARIOS / SUGERENCIAS
JULIAN CASTIBLANCO P
Julian_castiblancop@Hotmail.com
@jcastiblancop
www.azurecloud.com.co
http://julycastiblanco.blogspot.com.co/
https://www.facebook.com/ITProsDC
http://www.meetup.com/ITPROS-DC/

Más contenido relacionado

Similar a Descubriendo los Datos - Bodegas de datos

Operations & Data GraphNeo4j

24 HOP edición Español - Ssas multidimensional mejores practicas - Ahias Port...SpanishPASSVC

Taller Admin Báscio Comunidad MTYGrissell Cabrera Suárez

Presentación DataviXon qqqqqqqqeeeeeeeeqqqAngel Tello

Tech Talk Live - ITPROSDC - Big data con julyJulián Castiblanco

Implementando un Data Mart con SQL Server 2016Raul Martin Sarachaga Diaz

Construyendo hechos y dimensiones lentamente cambiantes para tu dwSpanishPASSVC

BI real time analyticsSolidQ

2 Desa Sincrono 2 Caso Modelamiento.docaldair441257

24 HOP Español - Utilizando cdc para cargar dw on line - Miguel EgeaSpanishPASSVC

Optimizacion de Modelos Multidimensionales con Analysis ServicesMarco Tulio Gómez Reyes

Click houset3chfestFco. Javier Sanz Olivera

Escribiendo código T-SQL eficientementeJoseph Lopez

Introduccion a las Bodegas de DatosJoseph Lopez

Universidad de Concepción - EDUTIC 2011EDUTIC

SQLSaturday 346 El Salvador 2015 Cubes PerformanceMarco Tulio Gómez Reyes

Every angle analisis operacional de negocios - resumenMichiel van Kerkhoff

Gira Latam Gold 2021 - Mejores Prácticas de Modelado con Power BIdbLearner

Webinario storytellingIT-NOVA

Presentación.pdfWalter246991

Similar a Descubriendo los Datos - Bodegas de datos (20)

Operations & Data Graph

24 HOP edición Español - Ssas multidimensional mejores practicas - Ahias Port...

Taller Admin Báscio Comunidad MTY

Presentación DataviXon qqqqqqqqeeeeeeeeqqq

Tech Talk Live - ITPROSDC - Big data con july

Implementando un Data Mart con SQL Server 2016

Construyendo hechos y dimensiones lentamente cambiantes para tu dw

BI real time analytics

2 Desa Sincrono 2 Caso Modelamiento.doc

24 HOP Español - Utilizando cdc para cargar dw on line - Miguel Egea

Optimizacion de Modelos Multidimensionales con Analysis Services

Click houset3chfest

Escribiendo código T-SQL eficientemente

Introduccion a las Bodegas de Datos

Universidad de Concepción - EDUTIC 2011

SQLSaturday 346 El Salvador 2015 Cubes Performance

Every angle analisis operacional de negocios - resumen

Gira Latam Gold 2021 - Mejores Prácticas de Modelado con Power BI

Webinario storytelling

Presentación.pdf

Más de Julián Castiblanco

70461 Sesion2 Uso del SELECT, DISTINCT, CASEJulián Castiblanco

Global Azure Bootcamp 2016 Bogota SQL2016 dba IaaS PaaS v4Julián Castiblanco

Global Azure Cloud Camp Bogota Introduccion Azure datalakeJulián Castiblanco

Lecciones aprendidas SQL Server AlwaryOnJulián Castiblanco

Sql saturday 448 migración de bases de datos sql server hacia azure sqldbJulián Castiblanco

Databaseadmonfundamentalitprosdcchapter6Julián Castiblanco

Taller de sql server no 3Julián Castiblanco

Optimización de motores sql server 24 horas SQL PassJulián Castiblanco

Database admonfundamental itprosdc_chapter2Julián Castiblanco

Database Fundamentals - Sesión 1 - SQL ServerJulián Castiblanco

Carbura tusql sesion2_slideshareJulián Castiblanco

Carbura tusql sesion1_slideshareJulián Castiblanco

Tarea dqs en ssis nunca terminaJulián Castiblanco

Características Adminsitración SQL Server 2012 Parte 3Julián Castiblanco

70 462 Instalación SQL Server 2012Julián Castiblanco

Instalación de Sql server 2014 ctp2 sobre azureJulián Castiblanco

XQuery y XPath for SQL Server 2012 itpros dc_chapter6Julián Castiblanco

Agrupando datos en SQL ServerJulián Castiblanco

Taller básico de JOINS, SUBQUERYING, APPLY, CTEJulián Castiblanco

Introducción a JOINS, CTE, APPLY y SUBCONSULTASJulián Castiblanco

Más de Julián Castiblanco (20)

70461 Sesion2 Uso del SELECT, DISTINCT, CASE

Global Azure Bootcamp 2016 Bogota SQL2016 dba IaaS PaaS v4

Global Azure Cloud Camp Bogota Introduccion Azure datalake

Lecciones aprendidas SQL Server AlwaryOn

Sql saturday 448 migración de bases de datos sql server hacia azure sqldb

Databaseadmonfundamentalitprosdcchapter6

Taller de sql server no 3

Optimización de motores sql server 24 horas SQL Pass

Database admonfundamental itprosdc_chapter2

Database Fundamentals - Sesión 1 - SQL Server

Carbura tusql sesion2_slideshare

Carbura tusql sesion1_slideshare

Tarea dqs en ssis nunca termina

Características Adminsitración SQL Server 2012 Parte 3

70 462 Instalación SQL Server 2012

Instalación de Sql server 2014 ctp2 sobre azure

XQuery y XPath for SQL Server 2012 itpros dc_chapter6

Agrupando datos en SQL Server

Taller básico de JOINS, SUBQUERYING, APPLY, CTE

Introducción a JOINS, CTE, APPLY y SUBCONSULTAS

Último

Los primeros 60 países por IDH en el año (2024).pdfJC Díaz Herrera

6.3 Hidrologia Geomorfologia Cuenca.pptxBrallanDanielRamrezS

Conversacion.pptx en guarani boliviano latinoBESTTech1

variables-estadisticas. Presentación powerpointaria66611782972

Porcentaje de población blanca europea en Europa Occidental (1923-2024).pdfJC Díaz Herrera

AMNIOS Y CORDON UMBILICAL en el 3 embarazo (1).docxlm8322074

data lista de ingresantes de la universidad de ucayali 2024.pdfLizRamirez182254

MARCO TEORICO, SEMINARIO DE INVESTIGACION,EmmanuelDelJessGonza

diseño de una linea de produccion de jabon liquido.pptxHhJhv

Principales Retos Demográficos de Puerto RicoRaúl Figueroa

Perú en el ranking mundial, segun datos mineriaItalo838444

Las familias más ricas de África en el año (2024).pdfJC Díaz Herrera

Los idiomas más hablados en el mundo (2024).pdfJC Díaz Herrera

Investigacion cualitativa y cuantitativa....pdfalexanderleonyonange

max-weber-principales-aportes de la sociologia (2).pptxMarioKing10

biometria hematica y hemostasia y preanalitica.pptxmariabeatrizbermudez

decreto 2090 de 2003.pdf actividades de alto riesgo en Colombiaveronicayarpaz

INFORME FINAL ESTADISTICA DESCRIPTIVA E INFERENCIALMANUELVILELA7

El Manierismo. El Manierismofariannys5

procedimiento paran la planificación en los centros educativos tipo v(multig...claudioluna1121

Descubriendo los Datos - Bodegas de datos

2. Bodegas de Datos Julián Castiblanco P Líder de la comunidad ITPROS-DC https://www.facebook.com/ITProsDC http://www.meetup.com/ITPROS-DC/

3. Julián Castiblanco P. http://www.azurecloud.com.co/ http://julycastiblanco.blogspot.com.co/ co.linkedin.com/juliancastiblancop @jcastiblancop Julian_castiblancop@hotmail.com Database Consultant- Synergy TPC MVP Data Platform Miembro de PASS ITPros-DC Chapter

4. http://conta.cc/29wAQXe

5. Agenda • Algo de teoría Básica • Qué son los paquetes ETLs • Nuevas tendencias

6. Recomendados del día

7. Inmon

8. Kimball http://www.kimballgroup.com/

9. Diferencias en diseño Bodegas de datosSistemas transaccionales ER Diagram

10. Llaves Sustitutas Vs Llaves de Negocio CustomerKey CustomerAltKey FirstName LastName 1 1002 Amy Alberts 2 1005 Neil Black Llave Suplente Llave de Negocio Atributos Adicionales de la dimensión

11. Atributos y Jerarquías CustKey CustAltKey Name Country State City Phone Gender 1 1002 Amy Alberts Canada BC Vancouver 555 123 F 2 1005 Neil Black USA CA Irvine 555 321 M 3 1006 Ye Xu USA NY New York 555 222 M Jerarquías FiltrosDetalles Detalles

12. Dimensiones Lentamente Cambiantes CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 123 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver Yes 1/1/2000 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 222 Tipo 1 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012 4 1002 Amy Alberts Toronto Yes 1/1/2012 Tipo 2 CustKey CustAltKey Name Cars 1 1002 Amy Alberts 0 CustKey CustAltKey Name Prior Cars Current Cars 1 1002 Amy Alberts 0 1 Tipo 3

13. Dimensión de Tiempo • Granularidad • Rangos • Múltiples Calendarios • Incluir una fecha por defecto DateKey DateAltKey MonthDay Day MonthNo Month Year 00000000 01-01-1753 NULL NULL NULL NULL NULL 20130101 01-01-2013 1 Tue 01 Jan 2013 20130102 01-02-2013 2 Wed 01 Jan 2013 20130103 01-03-2013 3 Thu 01 Jan 2013 20130104 01-04-2013 4 Fri 01 Jan 2013

14. Dimensiones Auto-Referenciadas EmployeeK ey EmployeeAltK ey EmployeeNa me ManagerKe y 1 1000 Manuel NULL 2 1001 Julio 1 3 1002 Cesar 1 4 1003 Dora 2 Manuel Julio Dora Cesar

15. Dimensiones Chatarra Agrupa características o dimensiones relacionadas y pequeñas en una sola dimensión para simplificar el modelo Estrella y mejorar los tiempos de respuesta de la bodega de datos. JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit 1 1 1 Credit 2 1 1 Debit 3 1 0 Credit 4 1 0 Debit 5 0 1 Credit 6 0 1 Debit 7 0 0 Credit 8 0 0 Debit

16. Columnas de la tabla de hechos OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20120101 25 120 1000 1 350.99 20120101 99 120 1000 2 6.98 20120101 25 178 1001 2 701.98 Llaves de las dimensiones Dimensiones Degeneradas Medidas

17. Tipos de Tablas de Hechos • Tabla de hechos a nivel de transacción • Foto de Periodo Por Fecha • Foto De Periodo Acumulado OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount 20120101 25 120 1000 1 125.00 350.99 20120101 99 120 1000 2 2.50 6.98 20120101 25 178 1001 2 250.00 701.98 DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock 20120101 25 25 1 3 23 20120101 99 120 0 2 118 OrderNo OrderDateKey ShipDateKey DeliveryDateKey 1000 20120101 20120102 20120105 1001 20120101 20120102 00000000 1002 20120102 00000000 00000000

18. Que viene Ahora AUTOSERVICIO DE BI Fuentes no Normalizadas Cruces de informaciones no homogéneas Tiempos mínimos para la ejecución de proyectos

19. Azure SQL Data Warehouse https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-overview-what-is/

20. Azure SQL Data Warehouse

21. Azure SQL Data Warehouse

22.

23. Azure Data Lake https://azure.microsoft.com/en-us/solutions/data-lake/

24. Hadoop

25. PREGUNTAS / COMENTARIOS / SUGERENCIAS JULIAN CASTIBLANCO P Julian_castiblancop@Hotmail.com @jcastiblancop www.azurecloud.com.co http://julycastiblanco.blogspot.com.co/ https://www.facebook.com/ITProsDC http://www.meetup.com/ITPROS-DC/

Notas del editor

Use this topic to ensure that all students understand why the business key from the source system is not used as a unique key in dimension tables.
Emphasize that the categorization of attributes in this topic is simply used to help identify reasons why a data value would be included as a dimension attribute column. You do not need to apply any specific configuration to define an attribute as a slicer or a member of a hierarchy. Point out that the levels of the hierarchy are all stored within a single dimension table, resulting in duplication. This is preferable to normalizing the data to create a table for each hierarchy in a snowflake schema. OLTP database developers might find this preference for duplication over normalization unintuitive, but remind them that dimension data is generally denormalized from multiple tables before being loaded, and does not experience the same level of transactional updates as would occur in an OLTP database. Therefore, the performance benefits of storing the data in a single table generally outweigh the reduced duplication benefits of normalizing the data.
The slide shows a before and after representation of the changes in Type 1, Type 2 and Type 3 tables.
Discuss the issues in the bulleted list in the student content. In some cases, you might choose to include a column for the parent alternate key as well as the parent key, because this can be useful in some load techniques. Some techniques for loading self-referencing dimension tables are discussed in Module 4: Designing an ETL Solution.
As an alternative to a junk dimension, fact-specific attributes can be used to create degenerate dimensions in the fact table. This approach is discussed in the next lesson.
Point out that degenerate dimension columns provide the same capability as a junk dimension table. In a scenario where only one fact table requires the additional miscellaneous attributes for analysis and reporting, it is generally more efficient to include them as degenerate dimension columns. Conversely, if the additional attributes are relevant for multiple fact tables, a junk dimension is probably a better choice. Discuss the note about fact table primary keys in the student manual. Students with a strong background in relational database design might feel uncomfortable about not defining a primary key for every table. If, however, there is no need to uniquely identify individual fact rows, and the ETL process can be relied on to eliminate accidental duplicate entries, defining a primary key adds unnecessary overhead to the table definition and generates an index, which can negatively affect the performance of data loads. Similarly, note that declaring foreign-key constraints on dimension-key columns in a fact table is not necessary to enforce referential integrity in most data warehouses, and can negatively impact load performance. You can declare them, and then drop and recreate them during each load, but this creates its own overhead and adds little value if the ETL process is correctly implemented. The query optimizer can use foreign-key constraints to identify the fact table in a star join query but, in their absence, selects the largest table, which is usually correct.
Discuss the importance of including a row for “Unknown” or “None” in the time dimension table when using accumulating snapshot fact tables. Point out that accumulating snapshot fact tables must be updated after the initial load. This requirement can affect the physical design of the table, especially if partitions or column store indexes are used. These considerations are discussed in the next lesson and in Module 4: Designing an ETL Solution.

Descubriendo los Datos - Bodegas de datos

Recomendados

Recomendados

Más contenido relacionado

Similar a Descubriendo los Datos - Bodegas de datos

Similar a Descubriendo los Datos - Bodegas de datos (20)

Más de Julián Castiblanco

Más de Julián Castiblanco (20)

Último

Último (20)

Descubriendo los Datos - Bodegas de datos

Notas del editor