Analytics to the masses by Jose Luis Lopez at Big Data Spain 2014

•

9 recomendaciones•2,649 vistas

http://www.bigdataspain.org/2014/conference/analytics-to-the-masses In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. https://www.youtube.com/watch?v=o5ulDWr7zWg

Datos y análisis Tecnología

BIG DATA ANALYTICS TO THE MASSES
JOSE LUIS LÓPEZ PINO
DATA ENGINEER GETYOURGUIDE

Big Data Analytics
to the masses
Why it has failed and how we can fix it
Jose Luis Lopez Pino

Who am I?
BI Consultant
Large-Scale & Distributed
Founding
Data Engineer

Big Data is like Tourism
But if you aren’t an expert,
you can’t make the most of it
It seems easy to do

Struggle to analyze Big Data
Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data
Scientists and Their Work. O’Reilly Media, Inc., 2013
Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and
visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions

Tools
Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era.
Proceedings of the VLDB Endowment, 7(13), 2014

Tools (October 2014)
Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the
big data era. Proceedings of the VLDB Endowment, 7(13), 2014

We need libraries...
Libraries!
Query languages
Write your own
MR/RDD/Transformations

Say it with memes!
When you do
Deep analytics in small data
using R and CRAN packages
When you do
deep analytics in BIG data
using R and CRAN packages

When you try to program it
using MapReduce
When you try to program it
using Apache Spark /
Apache Flink
When you try to use a library
scalable to large data sets

Can’t we do it better?
- Make it similar to normal R
programs.
- Hide complexity.
- Make file manipulation easier.
- Part of the computing in the
cluster and part of the
computer in the client.

Without writing significantly different code

Competitive or even faster than R native code in small data

Some relevant findings
- Transmission time was not significant.
- Stratosphere/Flink was competitive in highly
iterative programs.
- We were not able to do it keeping the code
100% the same.
- Ensemble scenarios are the most exciting
ones.

4 Takeaways from this talk
- We still need to bring Big Data to the right
people in the right place.
- We need comprehensive libraries.
- We need to move data back and forth.
- Use a syntax that the users are familiar with.

That’s all!
- Have you found this talk interesting?
- Follow me: @jllopezpino
- Interested in a job as SEM Data Analyst
(Berlin)?
- Ask me for the details:
- Are you interested in Data + Energy?
- Keep in touch:

Más contenido relacionado

Destacado

NoSQL databases have emerged as a response to some perceived problems in the RDBMSs: agile/dynamic schemas; and transparent, horizontal scaling of the database. The former has been promptly targeted with the introduction of unstructured data types, but scaling a relational databases is still a very hard problem. As a consequence, all NoSQL databases have been built from scratch: their storage engines, replication techniques, journaling, ACID support (if any). They haven't leveraged the previously existing state-of-the-art of RDBMSs, effectively re-inventing the wheel. Isn't this sub-optimal? Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database? Session presented at Big Data Spain 2015 Conference 16th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/fri/slot-37.html

ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...

Big Data Spain

Big Data the potential for data to improve service and business management by...

Big Data Spain

Intro to the Big Data Spain 2014 conference

Big Data Spain

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...

Big Data Spain

Talk by Gordon Guthrie, Senior Software Engineer at Basho Summary A review of the CAP Theorem and the difficulties of resolving conflicts in highly distributed systems. Covering the issues and various theories on how to resolve including the use CRDTs in Riak Details CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronisation. This leads to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings. The CRDT concept was first formally defined in 2007 by Marc Shapiro and Nuno Preguiça in terms of operation commutativity, and development was initially motivated by collaborative text editing. The concept of semilattice evolution of replicated states was first defined by Baquero and Moura in 1997, and development was initially motivated by mobile computing. The two concepts were later unified in 2011. Basho has worked with the EU and Marc Shapiro's team to push CRDTs into distributed systems. Riak v2.x is the first commercial product to include this functionality

Convergent Replicated Data Types in Riak 2.0

Big Data Spain

The ultimate business success of Big Data in business will depend on our ability to successfully bring about the realignment and placement of Big Data into a more generalized architectural framework, one that coalesces strategic, technical and management elements of data warehousing (DW 3.0), business intelligence, textual analysis and statistical analysis into a coherent, synergistic and usable whole. Session presented at Big Data Spain 2015 Conference 16th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/fri/slot-26.html

Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...

Big Data Spain

Analyzing organization e-mails in near real time using hadoop ecosystem tools...

Big Data Spain

At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week. We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html

Essential ingredients for real time stream processing @Scale by Kartik pParam...

Big Data Spain

The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago. O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science. By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact. This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change? Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-2.html

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Big Data Spain

Apache Spark has successfully built on Hadoop infrastructure to encompass real-time processing, moving from rigid Map-Reduce operations to general purpose functional operations distributed across a cluster of machines. However data storage has become a black box. The source data for a query has to be retrieved in full and sent through the analysis pipeline rather than processing the data where it is stored, as in traditional database systems. This introduces significant cost, both in network utilisation and in the time taken to produce a result. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-14.html

A new streaming computation engine for real-time analytics by Michael Barton ...

Big Data Spain

DKS EAP (Enterprise Analytical Portal) was first thought of as an integrated analytical portal; a conjunction of BI and BA products (ranging from advanced customer intelligence to marketing analytics) assembled under a unified interface, specifically designed to better assist business in their decision-making. We believe DKS EAP is of special interest for the Big Data community nowadays since it is currently leveraging technologies such as Hadoop, Cassandra, Spark and Storm to improve its analytic capabilities. Accordingly, the talk will briefly present DKS EAP as an integrated Big Data analytic environment, focusing on how the fast-growing needs for real-time, social media and geomarketing data analysis intensified its necessity to embrace Big Data technologies. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-8.html#spch9.1

How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...

Big Data Spain

Purpose of the talk: Describing the use of Machine Learning and Big Data Techniques to improve the performance of elearning students. Presenting an existing case of an elearning platform (iAdLearn¡ng) and the technology used behind the scenes, to make adaptive/high performance elearning a reality. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-17.html

IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...

Big Data Spain

Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-9.html

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...

Big Data Spain

Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...

Big Data Spain

Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery. Session presented at Big Data Spain 2014 Conference 18th Nov 2014 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...

Big Data Spain

This session shows how to secure different Big Data sensitive data items such as log files, metastore databases, control files, config files, data directories or data files for different Big Data technologies. As Hadoop, MongoDB, Cassandra and other massively distributed Big Data stores grow in popularity, so too does the volume of sensitive regulatory data that gets captured for analysis. Cloudera Navigator Encrypt gives peace of mind, knowing the sensitive information used to run massive-scale queries and analytics is secure. Navigator Encrypt works as a last line of defense for protecting data, by providing a transparent layer between the application and file system and securing information as it gets written to disk, ensuring minimal performance lag in the encryption or decryption process. The solution also includes robust key management and process-based access controls, while simultaneously preventing admins or super users like root from accessing data that they don’t need to see allowing users to store their cryptographic keys separate from the encrypted data. Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-13.html

Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...

Big Data Spain

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...

Big Data Spain

Destacado (17)