El video de esta presentación esta en mi blog (www.alankoo.com).
Introducción a lo que és y no es Big Data, y la estrategia de Microsoft basada en HDInsight, una distribución basada 100% en Apache Hadoop la cual nos lleva a manejar nuevos escenarios dentro del mundo de Inteligencia de Negocios. Este webcast fue originalmente presentado en el evento "Maratón de Business Intelligence" (Intermezo y Microsoft TechNet)
POWER POINT YUCRAElabore una PRESENTACIÓN CORTA sobre el video película: La C...
Big Data: El qué y el cómo
1. Big Data: El qué y el cómo
Alan Koo
Senior Consultant | Nagnoi, Inc.
www.alankoo.com | @alan_koo
2. Acerca de mi
•
•
•
•
•
•
•
•
•
•
•
Senior Consultant en Nagnoi, Inc.
13+ años en SQL Servidor
8+ años en BI & OLAP
Certificaciones Microsoft en SQL Servidor, Business Intelligence y .NET
MCT Regional Lead – Puerto Rico
MCT desde 2004 para Business Intelligence / SQL Server / .NET
Miembro del Microsoft BI Advisors group
Miembro del SSRS Insiders Group
Microsoft MVP (2008 – 2011)
Co-fundador de Puerto Rico PASS
Blogger: www.alankoo.com
Alan Koo | www.alankoo.com
3. Agenda
Qué es Big Data?
Fuentes comunes de Big Data
Escenarios comunes de Big Data
HDInisght: Windows Azure + Hadoop
Alan Koo | www.alankoo.com
10. La transición a Big Data
small
fk/pk
Volumen
big
pull
PDW
SQL Server
Velocidad
k/v
push
HDInsight
11. ¿Qué es Big Data?
Social
Sentiment
Click
Stream
Petabytes
(10E15)
Terabytes
(10E12)
Volumen
Exabytes
(10E18)
Gigabytes
(10E9)
Móvil
Internet de cosas / Blogs
Wikis
Sensores / RFID /
Dispositivos
Audio /
Video
Archivos de
Log
WEB 2.0
Publicidad eCommerce
Colaboración
ERP / CRM
Marketing
Digital
Search Marketing
Pagos
Planilla
s
Inventari
o
Contacto
s
Seguimiento
de Ordenes
Gestión de
Ventas
Coordenadas Espaciales &
GPS
Data Market
Feeds
eGov Feeds
Web Logs
Clima
Recomendacione
s
Text/Imágenes
Velocidad - Variedad - variabilidad
ERP / CRM
Almacenaje/GB
1980
190,000$
1990
9,000$
WEB
2.0
Internet de
cosas
2000
15$
2010
0.07$
Alan Koo | www.alankoo.com
12. Big Data
“Big data es un término que describe el almacenaje y el
análisis de grandes y/o complejos conjuntos de datos
usando una serie de técnicas incluyendo, pero no limitado
a: NoSQL, MapReduce and machine learning.”
“Big data is a term describing the storage and analysis of large and or complex data sets
using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”
http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define-it/
Alan Koo | www.alankoo.com
13. Big Data es….
No el Tamaño de los Datos
No las herramientas chéveres como Hadoop y R
Un Nuevo Paradigma en Cómo Recolectar y Usar Datos de
manera Diferente.
Alan Koo | www.alankoo.com
16. Cosas que los clientes pueden estar diciendo
• Necesitamos paralelizar las operaciones de datos pero es muy costoso y
complicado…
• El negocio no puede acceder a toda la data relevante, necesitamos data externa…
• No podemos coincidir la data maestra del cliente durante interacciones en vivo…
• No podemos forzar a que todo sea un modelo estrella (star-schema)
• Nuestro reportes y gráficas de BI no nos dicen nada que no sepamos
• Estamos perdiendo la ventana del ETL, la data que necesitamos no llega a
tiempo…
• No podemos predecir con confidencia si no podemos explorar los datos y
desarrollar nuestros propios modelos
19. Si Chris Paul pasa la bola a un compañero
a 1.5
metros o menos del
89
porciento de chance de que
canasto, existe un
termine en
anotación
Chris Paul passes the ball to a teammate within
five feet of the basket, there’s an 89
percent chance it will result in a score
http://www.adweek.com/news/technology/nba-making-big-data-play-153264
20. Fuentes de datos comunes
Progressive: http://articles.chicagotribune.com/2013-09-15/classified/ct-biz-0915--telematics-insure-20130915_1_insurance-companies-insurance-telematics-progressive-snapshot
Alan Koo | www.alankoo.com
23. Hadoop
• Colección de proyectos “open source” en Apache para
almacenar/procesar big data (grandes datos no/semiestructurados)
• Ha evolucionado sobre los últimos 7+ años para soportar
alguno de los websites/productos más grandes en
términos de datos
• La base/”kernel” de HDInsight
Alan Koo | www.alankoo.com
28. ¿Qué es HDInsight?
• Plataforma de datos de nivel empresarial
• Contruído sobre Hadoop en sociedad con Hortonworks
• Actualmente disponible en como servicio “preview” en
Windows Azure
Alan Koo | www.alankoo.com
29. Windows Azure HDInsight Service
Job submission (hive query, etc)
Query &
Metadata:
Data
Movement:
Workflow:
Monitoring:
Hadoop Filesystem Interface
Data upload/download
Alan Koo | www.alankoo.com
34. HDFS en Azure: Historia de dos Sistemas de Archivos
HDFS API
Name Node
Azure Blob Storage
de
Front end
Front end
Front end
Data Node
Data Node
Partition Layer
…
Stream Layer
DFS (1 Data Node per Worker Role) and Compute Cluster
Azure Storage (ASV)
Alan Koo | www.alankoo.com
35. Azure Storage (ASV)
• Sistema de archivos por defecto para HDInsight
• Provee almacenamiento que se puede compartir, persistente, de alta
escalabilidad y disponibilidad (Azure Blob Store)
• Azure storage por si solo no provee computo
• Acceso rápido desde los nodos de cómputo a la data en el mismo data center
• Varios sistemas de archivos, se puede llegar vía:
asv[s]:<container>@<account>.blob.core.windows.net/<path>
• Requiere el storage key en core-site.xml:
<property>
<name>fs.azure.account.key.accountname</name>
<value>enterthekeyvaluehere</value>
</property>
Alan Koo | www.alankoo.com
36. Consumiendo resultados desde HDInsight
Destino
Herramienta / Librería
Requiere un Cluster de
HDInsight Activo
SQL Server,
Azure SQL DB
Sqoop (Hadoop ecosystem project)
Sí
Excel
Codename “Data Explorer”
No
Another Blob Storage
Account
Azure Blob Storage REST APIs (Copy Blob, etc)
No
SQL Server Analysis
Services
Hive ODBC Driver
Sí
Existing BI Apps
Hive ODBC Driver (assumes app supports
ODBC connections to data sources)
Sí
Alan Koo | www.alankoo.com
38. Entorno de DW/BI del mañana
ETL
Data Warehouse
Crítico para el negocio
OLAP
Reporting
39. Solución de Big Data de Microsoft
Alan Koo | www.alankoo.com
40. En resumen
• HDInsight es una plataforma de nivel empresarial basada
en Hadoop para almacenamiento de “big data
• Azure Blob Storage + HDInsight == Almacenamiento y
procesamiento de “big data” simple y en la nube disponible
para probar hoy mismo
• Podemos consumir los resultados de HDInsight en
herramientas familiares, aplicaciones, etc (Excel, etc) es
simple con Power Query, Azure Blob APIs, Sqoop, ODBC,
etc.
Alan Koo | www.alankoo.com
42. Recursos / Referencias
http://brianwmitchell.com/
bit.ly/loKoMN
– Do You Have Big Data? (Most Likely!) bit.ly/1awKcqE
– Introduction To Windows Azure HDInsight Service bit.ly/1awL923
– Data Management in Microsoft HDInsight: How to Move and Store
Your Data bit.ly/16jqv9M
– Make Your Apps Smarter with Azure HDInsight bit.ly/1b1mtQN
http://nuget.org/packages?q=hadoop
http://hadoopsdk.codeplex.com
Alan Koo | www.alankoo.com
Slide Objectives:Set up the problem: Devices, social network are causing an explosion of data. 1.8 Zbytes last year and in 2 years we will have 7.8 Zbyte worth of data being created each year.Transition:Transition statement(s) to setup the slideSpeaking Points:New devices and use scenarios are creating more data than ever. Cheaper Storage and compute makes it possible to process some of the data, thus “big data” tools and industry have been created.Notes:These are the trends that are triggering the big data revolution. Most of us are already familiar with them, however we need to take another look at them from new perspectives. Almost everyone here has one or more mobile devices, the world currently has 5.5 billion devices which reaches 70% of the world’s population. Social Network, such as Facebook and twitter, have more than 2 billion users and are growing fast, we will reach 7.2 Zetta bytes of information created per year by 2015. In addition to the data humans are creating, the next growth area is sensornetworks or “internet of thigns”, we will have more than 10 billion networked sensors in the very near future. At the same time, we are seeing two other trends that are going in the opposite directions, the cost of compute and storage have gone down rapidly. These two trends are also helping to grow the big data industry. When you see an explosive growth of data and the rapid decrease of storage prices. There’s suddenly an opportunity to invest in big data. In return we get not only information, insight, but also increased productivity and competitiveness. Things we weren’t able to do before suddenly became feasible.
Slide Objectives:Types of data and the characteristics of big data Transition:Big data is not simply about the volume, but about how fast they move and their unstructured nature.Speaking Points:Volume: we’ve created 1.8 Zettabyte in the past year, and it will double every 1.5 years.Data velocityadds to the difficulties; the SLA becomes much more difficult to service when you have constant incoming data such as social networks and internet of things. We just can’t simply stop data sources from producing data while we fix our systems.Notes:Variety = different types of data. and variability => data structure changes over time. Gartner’s Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”McKinsey Global Institute in May 2011: “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”
Slide Objectives:Types of data and the characteristics of big data Transition:Big data is not simply about the volume, but about how fast they move and their unstructured nature.Speaking Points:Volume: we’ve created 1.8 Zettabyte in the past year, and it will double every 1.5 years.Data velocityadds to the difficulties; the SLA becomes much more difficult to service when you have constant incoming data such as social networks and internet of things. We just can’t simply stop data sources from producing data while we fix our systems.Notes:Variety = different types of data. and variability => data structure changes over time. Gartner’s Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”McKinsey Global Institute in May 2011: “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”
Telematics:Progressive and telematics car device to help insurers to get better discounts based on their driving habitsTEXTO: JSON (semi-structure) into relational Dell – twitter. Pattern recognition. Kind of complaints. Detect before it happensHEALTHCARE: Fraud detection, patients, doctor notes. Legal cases: Search in emailsPLACE AND TIME: Foursquare, facebook, fitbit, run keeper. Geospatial information.Facebook recommend friends using geospatial, familiary in places that I goBiking: who people does biking, recommend thatRFID: (hoy) Tags en warehouses, en paletasEach item in the grossery store, where people take the items (at the door, in the aisle, in the checkout, etc.) SMART GRID: Smart metters, a lot of data, to bill specifically for what you are using, more information when the service is useSensors in everywhere, cars, airplanes, looking for predictive analytics to prevent failures in the futureXbox, gaming, what are you using, what is to hard, to easy? So they can do the game more difficult or easier.Retail: Vending machines, inventory, stocksLaw enforcements: braceletsMove from one phone company to another: how many different people she interacts with, they don’t want to loose her. A lot of interactions with their customers.Organization:
Similar items:Similar web pages.Colaborative Filtering: AmazonData Stream mining:Summarize it? Or evaluate a setLast 30 twits this is what people is sayingImages, case study with the NY police department – ManhatanItem sets:Diapers and beersBuy a laptop, likely to buy a mouse or monitorPut items together in the market, wine and cheesePlagerism (plagio), items (documents)Related web pages (based on words)Customers are more positive about this and more negative about thisClustering: cluster items: SSAS Excel Data MiningRecommendation systems: Netflix: movieSocial Network: Communication unsuccessful between member teams. They
Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Map reduce is about minimizing the movement of data inside your cluster.The job tracker understands where all the data blocks are, and will send the operation code to the node that contains the data.Notes:Any notes go here
Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
Slide Objectives:Understand the HDInsight eco-systemTransition:Transition statement(s) to setup the slideSpeaking Points:Biggest buzzword in Big Data right now is HadoopIt can mean many things, but always includes HDFS and MapReduceHDInsightRed = in product nowBlue = planned for productGreen = ecosystem can connect nowPurple = Samples availableOrange = ecosystem plannedFlume, HBase are not available in the first release of HDInsight ServiceAs of 3/15, we don’t have an on-premise solution, thus AD integration is not yet available. System center integration will come later as well.The Green boxes are packages in the ecosystem that have not been included in the service, but should work out of the box by downloading them.Notes:Any notes go here
Slide Objectives:Provides 1 layer to access both attached/local storage on each node and the remote Windows Azure Blog storage which is the default.Transition:Transition statement(s) to setup the slideSpeaking Points:One interface to rule both DFS and Azure blob storageBlob storage:Front End: Security/Auth and scaled out request handlerPartition Layer: Object Layer, Mapping of objects such as Tables, Blobs, Queues to streams (cached in Front End), CCStream Layer: 3-Node HA, Scale-out stream storePlease see details from windows azure storage paper. IN some ways ASV changes things again, we are now moving data to the compute, since data is now remote. Blob storage allows you to persist your data even when you tear down your cluster.Notes:Any notes go here
Slide Objectives:Understand the details of ASVTransition:Transition statement(s) to setup the slideSpeaking Points:You will need to create an Azure storage account, you will need your acct name and key.You should create a cluster close to where your data is. (storage in west should create a cluster in the west data center).Notes:Any notes go here
Slide Objectives:Talk from the bottom layer up to discuss the Microsoft big data solution.Transition:Transition statement(s) to setup the slideSpeaking Points:BI Platform: Sql server analysis service and reporting service.Self service BI: powerview, powerpivot, predictive analysis and embedded BI.Taking in unstructured data and strutted data sources through Hadoop, or PDWNotes:Any notes go here
Slide Objectives:Vision slideTransition:Transition statement(s) to setup the slideSpeaking Points:Broaden access to Hadoop on the windows platformEnterprise ready through AD, System center (to come).BI integration and Self service BINotes:Any notes go here
Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here