Abstract:- Traditional machine learning and feature engineering algorithms are not efficient enough to extract complex and nonlinear patterns hallmarks of big data. Deep learning, on the other hand, helps translate the scale and complexity of the data into solutions like molecular interaction in drug design, the search for subatomic particles and automatic parsing of microscopic images. Co-locating a data processing pipeline with a deep learning framework makes data exploration/algorithm and model evolution much simpler, while streamlining data governance and lineage tracking into a more focused effort. In this talk, we will discuss and compare the different deep learning frameworks on Spark in a distributed mode, ease of integration with the Hadoop ecosystem, and relative comparisons in terms of feature parity.
Apache Hadoop and related ecosystems have come to play a significant role in “Big Data Analytics”. They provide a rich and wide choices for handling format, source variation, fast-moving/evolving streaming data, security and trust handling, distributed and noisy sources, supported algorithms, high dimensions as well as scalability of cluster.
It goes without saying that colocating a data processing pipeline with a Deep Learning framework makes data exploration/algorithm and model evolution much simpler and at the same time makes data governance and lineage tracking a simpler effort.
Let’s do a segway
BigDL is a distributed deep learning library for Apache Spark*. Using BigDL, you can write deep learning applications as Scala or Python* programs and take advantage of the power of scalable Spark clusters.
• You want to use existing Hadoop/Spark clusters to run your deep learning applications, which you can then easily share with other workloads (e.g., extract-transform-load, data warehouse, feature engineering, classical machine learning, graph analytics). An undesirable alternative to using BigDL is to introduce yet another distributed framework alongside Spark just to implement deep learning algorithms.