Este documento introduce Apache Spark, un sistema de computación de clústeres rápido y expresivo. Spark es más rápido que Hadoop, ya que almacena datos en memoria para consultas iterativas. Spark es compatible con Hadoop y puede leer y escribir datos en cualquier sistema soportado por Hadoop como HDFS. Spark usa Resilient Distributed Datasets (RDD) que permiten transformaciones paralelas sobre colecciones distribuidas de datos.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasBig-Data-Summit
El objetivo de la charla es el de brindar una visión global de los distintos paradigmas de procesamiento que existen en Big data y de las tecnologías de última generación asociadas a cada uno de las etapas necesarias para desarrollar un proyecto Big Data.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
In order to understand and react to their security situation, many cybersecurity operations use Security information and event management (SIEM) software nowadays. Using a traditional SIEM in a large company such as HP Enterprise is a challenge due to the increasing volume and rate of data. We present the solution used to reduce data volume processed by the SIEM using Spark Streaming and the results obtained in processing one of the largest data feeds in HPE: Firewall logs. Testing of SIEM rules the traditional way is a time-consuming process. Usually, it is necessary to wait one day to get results and statistic for one-day production data. An alternative approach to build a SIEM using Spark and other big data technologies will be drafted and results of “fast forward” processing of production data snapshots will be presented. HPE is the target of sophisticated well-crafted attacks and deployed cyber Security tools are not able to detect all of them. A simple application, built using Spark MLlib and company-specific data for training, for detection of malicious trending domains will be described. Takeaways: Spark streaming can be used to pre-process cybersecurity data and reduce their amount for further processing. Spark MLlib can be used to add the additional detecting capability for specific use cases.
In this presentation, we will share how Hewlett Packard Enterprise has implemented Apache Spark to deal with three main cyber security use cases:
1) Using Spark to help Security information and event management (SIEM) process an increasing amount of data
2) Using Spark to test SIEMs rules by “fast forward” processing of production data snapshots.
3) Implementing machine learning to add an additional detection capability
Cost-Based Optimizer in Apache Spark 2.2 Databricks
Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries.
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasBig-Data-Summit
El objetivo de la charla es el de brindar una visión global de los distintos paradigmas de procesamiento que existen en Big data y de las tecnologías de última generación asociadas a cada uno de las etapas necesarias para desarrollar un proyecto Big Data.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
In order to understand and react to their security situation, many cybersecurity operations use Security information and event management (SIEM) software nowadays. Using a traditional SIEM in a large company such as HP Enterprise is a challenge due to the increasing volume and rate of data. We present the solution used to reduce data volume processed by the SIEM using Spark Streaming and the results obtained in processing one of the largest data feeds in HPE: Firewall logs. Testing of SIEM rules the traditional way is a time-consuming process. Usually, it is necessary to wait one day to get results and statistic for one-day production data. An alternative approach to build a SIEM using Spark and other big data technologies will be drafted and results of “fast forward” processing of production data snapshots will be presented. HPE is the target of sophisticated well-crafted attacks and deployed cyber Security tools are not able to detect all of them. A simple application, built using Spark MLlib and company-specific data for training, for detection of malicious trending domains will be described. Takeaways: Spark streaming can be used to pre-process cybersecurity data and reduce their amount for further processing. Spark MLlib can be used to add the additional detecting capability for specific use cases.
In this presentation, we will share how Hewlett Packard Enterprise has implemented Apache Spark to deal with three main cyber security use cases:
1) Using Spark to help Security information and event management (SIEM) process an increasing amount of data
2) Using Spark to test SIEMs rules by “fast forward” processing of production data snapshots.
3) Implementing machine learning to add an additional detection capability
Cost-Based Optimizer in Apache Spark 2.2 Databricks
Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries.
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
PowerQueryes una nueva herramienta de la familia PowerBI que se encarga del proceso de extracción y transformación de datos. Ha sido catalogada como una de las mejores herramientas para Excel en salir al mercado por parte de Microsoft en los últimos años y en esta sesión conocerás muchas razones por las cuales te encantará.
Introduccion a SQL Server 2016 Stretch DatabasesEduardo Castro
En esta presentacion vemos los aspectos de arquitectura, configuración y uso de Stretch Databases en SQL Server 2016.
Ing. Eduardo Castro, PhD
Microsoft Data Platform MVP
SQL Server
En esta presentación se introducen los servicios cognitivos de Microsoft Azure y cómo pueden integrarse con otras soluciones tales como PowerBI y R Studio
Tutorial en Apache Spark - Clasificando tweets en realtimeSocialmetrix
Apache Spark [1] es un nuevo framework de procesamiento distribuido para big data, escrito en Scala con wrappers para Python, que viene generando mucha atención de la comunidad por su potencia, simplicidad de uso y velocidad de procesamiento. Ya siendo llamado como el remplazo de Apache Hadoop.
Socialmetrix desarrolla soluciones en este framework para generar reportes y dashboards de información a partir de los datos extraídos de redes sociales.
Los participantes de este tutorial van aprender a levantar información de Twitter usando Spark Streaming, Desarrollar algoritmos para calcular hashtags más frecuentes, usuarios más activos en batch processing aplicarlos en realtime a los nuevos tweets que lleguen a través del stream.
La computación distribuída es un nuevo modelo de computación que surgió con el objetivo de resolver problemas de computación masiva donde diferentes máquinas trabajan en paralelo formando un clúster de computación.
En los últimos años han surgido diferentes frameworks como Apache Hadoop, Apache Spark y Apache Flink que permiten resolver este tipo de problemas donde tenemos datos masivos desde diferentes fuentes de datos.
Dentro del ecosistema de Python podemos destacar las librerías de Pyspark y Dask de código abierto que permiten la ejecución de tareas de forma paralela y distribuida en Python.
Entre los puntos a tratar podemos destacar:
Introducción a la computación distribuida
Comparando tecnologías de computación distribuida
Frameworks y módulos en Python para computación distribuida
Casos de uso en proyectos Big Data
Primeros pasos con Apache Spark - Madrid Meetupdhiguero
Primeros pasos con Spark dentro del Apache Spark Meetup group de Madrid (http://www.meetup.com/Madrid-Apache-Spark-meetup/events/198362002/)
Contenido:
- Introdución
- Conceptos básicos
- Ecosistema Spark
- Instalación del entorno
- Errores comunes
La caché es una memoria más pequeña y rápida que la de nuestro almacén de información, por lo que su uso se ha extendido desde el hardware hasta el software. Con la evolución de las tecnologías hacia internet y la utilización de diferentes servidores en entornos cloud, se ha creado la necesidad de desarrollar sistemas de caché distribuidos. Microsoft Azure nos provee de diferentes servicios de caché distribuida, pero debemos aprender a utilizarlos de la forma más eficiente. A lo largo de esta charla revisaremos los sistemas de caché disponibles y cómo podemos explotarlos en nuestras aplicaciones.
Cómo hacer analítica de los logs que producen las aplicaciones. Introducción teorica y práctica de las tres herramientas ELK(ElasticSearch, Logstash y Kibana).
La presente investigación es para determinar si es más conveniente usar las herramientas que ofrece Apache Hadoop o escoger a su rival a decir de muchos: Apache Spark.
“Apache Spark es el motor más rápido y de uso general para el procesamiento de datos a gran escala.”
...O al menos es de lo que se informa en el sitio oficial pero ¿es eso cierto? En esta época del BigData aparecen y se ven muchas soluciones y tecnologías que enriquecen el entorno ampliamente dominado por Apache Hadoop, sin embargo en la era de los metadatos Spark brilla con una luz diferente y empieza a hacerle sombra a Hadoop en el negocio del BigData.
Machine Learning con Azure Managed InstanceEduardo Castro
En esta presentación mostramos las opciones para implementar Machine Learning dentro de Azure, así como las formas de configurar y utilizar Python dentro de Azure Managed Instance
(PROYECTO) Límites entre el Arte, los Medios de Comunicación y la Informáticavazquezgarciajesusma
En este proyecto de investigación nos adentraremos en el fascinante mundo de la intersección entre el arte y los medios de comunicación en el campo de la informática.
La rápida evolución de la tecnología ha llevado a una fusión cada vez más estrecha entre el arte y los medios digitales, generando nuevas formas de expresión y comunicación.
Continuando con el desarrollo de nuestro proyecto haremos uso del método inductivo porque organizamos nuestra investigación a la particular a lo general. El diseño metodológico del trabajo es no experimental y transversal ya que no existe manipulación deliberada de las variables ni de la situación, si no que se observa los fundamental y como se dan en su contestó natural para después analizarlos.
El diseño es transversal porque los datos se recolectan en un solo momento y su propósito es describir variables y analizar su interrelación, solo se desea saber la incidencia y el valor de uno o más variables, el diseño será descriptivo porque se requiere establecer relación entre dos o más de estás.
Mediante una encuesta recopilamos la información de este proyecto los alumnos tengan conocimiento de la evolución del arte y los medios de comunicación en la información y su importancia para la institución.
Actualmente, y debido al desarrollo tecnológico de campos como la informática y la electrónica, la mayoría de las bases de datos están en formato digital, siendo este un componente electrónico, por tanto se ha desarrollado y se ofrece un amplio rango de soluciones al problema del almacenamiento de datos.
En este documento analizamos ciertos conceptos relacionados con la ficha 1 y 2. Y concluimos, dando el porque es importante desarrollar nuestras habilidades de pensamiento.
Sara Sofia Bedoya Montezuma.
9-1.
Es un diagrama para La asistencia técnica o apoyo técnico es brindada por las compañías para que sus clientes puedan hacer uso de sus productos o servicios de la manera en que fueron puestos a la venta.
Las lámparas de alta intensidad de descarga o lámparas de descarga de alta in...espinozaernesto427
Las lámparas de alta intensidad de descarga o lámparas de descarga de alta intensidad son un tipo de lámpara eléctrica de descarga de gas que produce luz por medio de un arco eléctrico entre electrodos de tungsteno alojados dentro de un tubo de alúmina o cuarzo moldeado translúcido o transparente.
lámparas más eficientes del mercado, debido a su menor consumo y por la cantidad de luz que emiten. Adquieren una vida útil de hasta 50.000 horas y no generan calor alguna. Si quieres cambiar la iluminación de tu hogar para hacerla mucho más eficiente, ¡esta es tu mejor opción!
Las nuevas lámparas de descarga de alta intensidad producen más luz visible por unidad de energía eléctrica consumida que las lámparas fluorescentes e incandescentes, ya que una mayor proporción de su radiación es luz visible, en contraste con la infrarroja. Sin embargo, la salida de lúmenes de la iluminación HID puede deteriorarse hasta en un 70% durante 10,000 horas de funcionamiento.
Muchos vehículos modernos usan bombillas HID para los principales sistemas de iluminación, aunque algunas aplicaciones ahora están pasando de bombillas HID a tecnología LED y láser.1 Modelos de lámparas van desde las típicas lámparas de 35 a 100 W de los autos, a las de más de 15 kW que se utilizan en los proyectores de cines IMAX.
Esta tecnología HID no es nueva y fue demostrada por primera vez por Francis Hauksbee en 1705. Lámpara de Nernst.
Lámpara incandescente.
Lámpara de descarga. Lámpara fluorescente. Lámpara fluorescente compacta. Lámpara de haluro metálico. Lámpara de vapor de sodio. Lámpara de vapor de mercurio. Lámpara de neón. Lámpara de deuterio. Lámpara xenón.
Lámpara LED.
Lámpara de plasma.
Flash (fotografía) Las lámparas de descarga de alta intensidad (HID) son un tipo de lámparas de descarga de gas muy utilizadas en la industria de la iluminación. Estas lámparas producen luz creando un arco eléctrico entre dos electrodos a través de un gas ionizado. Las lámparas HID son conocidas por su gran eficacia a la hora de convertir la electricidad en luz y por su larga vida útil.
A diferencia de las luces fluorescentes, que necesitan un recubrimiento de fósforo para emitir luz visible, las lámparas HID no necesitan ningún recubrimiento en el interior de sus tubos. El propio arco eléctrico emite luz visible. Sin embargo, algunas lámparas de halogenuros metálicos y muchas lámparas de vapor de mercurio tienen un recubrimiento de fósforo en el interior de la bombilla para mejorar el espectro luminoso y reproducción cromática. Las lámparas HID están disponibles en varias potencias, que van desde los 25 vatios de las lámparas de halogenuros metálicos autobalastradas y los 35 vatios de las lámparas de vapor de sodio de alta intensidad hasta los 1.000 vatios de las lámparas de vapor de mercurio y vapor de sodio de alta intensidad, e incluso hasta los 1.500 vatios de las lámparas de halogenuros metálicos.
Las lámparas HID requieren un equipo de control especial llamado balasto para funcionar
Índice del libro "Big Data: Tecnologías para arquitecturas Data-Centric" de 0...Telefónica
Índice del libro "Big Data: Tecnologías para arquitecturas Data-Centric" de 0xWord escrito por Ibón Reinoso ( https://mypublicinbox.com/IBhone ) con Prólogo de Chema Alonso ( https://mypublicinbox.com/ChemaAlonso ). Puedes comprarlo aquí: https://0xword.com/es/libros/233-big-data-tecnologias-para-arquitecturas-data-centric.html
6. ¿Qué es Spark?
▪ No es una versión modificada de Hadoop
▪ Independiente, rápido, MapReduce
▪ Almacenamiento en memoria para consultas iterativas muy rápidas
▪ Hasta 40x más rápido que Hadoop
▪ Compatible con el API Hadoop con respecto al almacenamiento
▪ Se puede leer / escribir en cualquier sistema soportado por Hadoop,
incluyendo HDFS, HBase,SequenceFiles,etc
7. Eficiente
• Almacenamiento en memoria
Utilizable
• API en Java, Scala, Python
• Shell Interactivo
Qué es Spark?
Fast and Expressive Cluster Computing System
Compatible con Apache Hadoop
8. Evolución de Spark
• 2008 - Yahoo! equipo de Hadoop inicia colaboración con laboratorio de
Berkeley Amp / Rad
• 2009 - Ejemplo de Spark construido para Nexus -> Mesos
• 2011 - "Spark está 2 años por delante de cualquier cosa en Google"
• 2012 - Yahoo! Trabaja con Spark / Shark
• Hoy - Muchas historias de éxito
10. Actualizaciones Spark Hadoop
• Hardware ha mejorardo desde que Hadoop comenzó:
• Gran cantidad de RAM, redes más rápidas (10 Gb +)
• Ancho de banda de los discos no se manteniene al día
• MapReduce es incómodo para las cargas de trabajo:
11. Spark, "lengua franca?"
• Soporte para muchas técnicas de desarrollo
• SQL, Streaming, Graph & in memory, MapReduce
• Escriba "UDF" una vez y utilizar en todos los contextos
• Pequeño, sencillo y elegante API
• Fácil de aprender y utilizar; expresivo y extensible
• Retiene ventajas de MapReduce (tolerancia a fallos ...)
12. Conceptos
• Colecciones de objetos se propagan en un
cluster, almacenada en RAM o en Disco
• Construido a través de
transformaciones paralelas
• Automáticamente reconstruido en caso de falla
• Transformations (e.g.
map, filter, groupBy)
• Acciones
(e.g. count, collect, save)
Escribir programas en términos de transformaciones en
conjuntos de datos distribuido
Resilient Distributed Datasets Operationes
13. Intercambio de Datos en MapReduce
iter. 1 iter. 2 . . .
Entrada
HDFS
leer
HDFS
escribir
HDFS
leer
HDFS
escribir
Entrada
consulta 1
consulta 2
consulta 3
resultado 1
resultado 2
número 3
. . .
HDFS
leer
Lento debido a la replicación, la serialización, y el disco IO
14. iter. 1 iter. 2 . . .
Entrada
Intercambio de Datos en Spark
Repartido
memoria
Entrada
consulta 1
pregunta 2
consulta 3
. . .
una vez
tratamiento
10-100× más rápido que la red y el disco
15. Prueba de rendimiento de Sort
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
Spark, 3x
más rápido
1/10 de
nodos
24. Programación en Spark
▪ Idea clave: conjuntos de datos distribuidos elásticos (DDR)
▪ Colecciones distribuidas de objetos que pueden ser
almacenadas en caché en la memoria a través de los
nodos del clúster
▪ Manipulada a través de diversos operadores paralelos
▪ Reconstruida automáticamente en caso de fallo
▪ Interfaz
▪ Lenguaje integrado Clean API Scala
▪ Puede ser usado interactivamente desde Scala
25. Trabajo con RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
26. DDR: Distribuidos
● Datos no tienen que estar en una misma máquina
● Los datos se separa en particiones
○ Si necesitamos podemos operar en nuestra partición de datos
al mismo tiempo
Trabajador
Worker
Worker
Driver
Block 1
Block 2
Block 3
27. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
Worker
Worker
Worker
Driver
val lines = spark.textFile("hdfs://...")
29. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
Worker
Worker
Worker
Driver
30. Ejemplos: Log Mining
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
Worker
Worker
Worker
Driver
Cargar mensajes de error en memoria y buscar
patrones
RDD Transformado
31. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
Worker
Driver
messages.filter(_.contains("mysql")).count()
Worker
Worker
32. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Driver
messages.filter(_.contains("mysql")).count()
Worker
Worker
Poner RDD en cache
33. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(_.contains("mysql")).count()
Acción
34. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Driver
messages.filter(_.contains("mysql")).count()
Worker
Block 1
Block 2
Block 3
35. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
tasks
tasks
messages.filter(_.contains("mysql")).count()
tasks
36. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
messages.filter(_.contains("mysql")).count()
Worker
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
37. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
messages.filter(_.contains("mysql")).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
38. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
messages.filter(_.contains("mysql")).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
39. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
40. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
tasks
tasks
tasks
Driver
41. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
42. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Driver
results
results
results
43. Ejemplos: Log Mining
Cargar mensajes de error en memoria y buscar patrones
val lines = spark.textFile("hdfs://...")
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split('t´)(2))
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(_.contains("mysql")).count()
messages.filter(_.contains("php")).count()
Driver
Poner datos en cache ³ Resultados más rápidos
1 TB de datos de log
• 5-7 sec desde cache vs. 170s desde disco
44. Interactive Shell
• Forma rápida de
Aprender Spark
• Disponible en
Python y Scala
• Se ejecuta como un
aplicación en una
existente Spark
Cluster ...
46. Source: http://spark.rstudio.com/
• Instalación mediante devtools
• Carga datos en Spark DataFrames
desde: data frames locales de R, Hive
tables, CSV, JSON.
• Conectar al instancias locales de
Spark y Cluster Remotos de Spark
sparklyr: R interface para Apache Spark
47. dplyr and ML in sparklyr
• Incluye 3 familias de funciones para machine learning
• ml_*: Machine learning algorithms para analizar datos en el paquete spark.ml.
• K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron
• ft_*: Feature transformers para manipulación de features individuales.
• sdf_*: Funciones para manipulación de SparkDataFrames.
• Provee un backend dplyr para manipulación, análisis y visualización
de datos
%>%
dplyr con ML en sparklyr
48. R Server: scale-out R, Enterprise
▪ 100% compatible con el open source R
▪ Se puede ejecutar en paralelo funciones R
▪ Ideal para parameter sweeps, simulation, scoring.
▪ Funciones distribuidas y escalables “rx” dentro del paquete
“RevoScaleR”.
▪ Transformaciones: rxDataStep()
▪ Estadísticas: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
▪ Algoritmos: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
▪ Paralelismo: rxSetComputeContext()
51. R Server Hadoop
Architecturehitecture
R R R R
R
R R R R
R
Microsoft R Server
Proceso R Maestro en Edge Node
Apache YARN con Spark
Proceso R Worker R en Data Nodes
Datos en almacenamiento distribuido
Procesos R en Edge Node
57. Creación de RDDs
# Load text file from local FS, HDFS, or S3
> sc.textFile(“file.txt”)
> sc.textFile(“directory/*.txt”)
> sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
> sc.hadoopFile(keyClass, valClass, inputFmt, conf)
# Turn a Python collection into an RDD
> sc.parallelize([1, 2, 3])
58. Transformaciones básicas
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
> squares = nums.map(lambda x: x*x) // {1, 4, 9}
zero or more others# Map each element to
> nums.flatMap(lambda
> # => {0, 0, 1, 0, 1,
x: => range(x))
2}
Range object (sequence
of numbers 0, 1, …, x-‐1)
# Keep elements passing a predicate
> even = squares.filter(lambda x: x % 2 == 0) // {4}
59. Acciones base
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as
> nums.collect() # => [1,
a local collection
2, 3]
# Count number of elements
> nums.count() # => 3
# Merge elements with function
> nums.reduce(lambda
an associative
x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)
# Return first K elements
> nums.take(2) # => [1, 2]
60. Trabajo con Key-Value Pairs
Spark’s “distributed reduce” transformaciones para
operar sobre RDDs de key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
64. Establecer el nivel de paralelistmo
Todas las operaciones RDD tienen un
parámetros para establecer la cantidad de
tareas
y: x + y, 5)> words.reduceByKey(lambda x,
> words.groupByKey(5)
> visits.join(pageViews, 5)
65. Ajuste de la Nivel de Paralelismo
Todos la par RDD operaciones tomar una
opcional segundo parámetro para número
de tareas
y: x + y, 5)>palabras.reduceByKey(lambda
X,
> palabras.groupByKey(5)
>visitas.unirse(Páginas
vistas, 5)
66. RDD Operadores Adicionales
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...