Join us as we continue this series of webinars specifically designed for the community by the community with the goal to share knowledge, spark innovation and further build and link the relationships within our HPCC Systems community.
Episode 11 includes Tech Talks featuring speakers from our community on topics covering Big Data solutions, Spark Integration and other ECL Tips leveraging the HPCC Systems platform.
1) Raj Chandrasekaran, CTO & Co-Founder, ClearFunnel - Scaling Data Science capabilities: Leveraging a homogeneous Big Data ecosystem
2) James McMullan, Software Engineer III, LexisNexis Risk Solutions - HDFS Connector Preview
3) Bob Foreman, Senior Software Engineer, LexisNexis Risk Solutions - Building a RELATIONal Dataset - A Valentine’s Day Special!
2. Welcome!
• Please share: Let others know you are here with #HPCCTechTalks
• Ask questions! We will answer as many questions as we can following each speaker.
• Look for polls at the bottom of your screen. Exit full-screen mode or refresh your screen if
you don’t see them.
• We welcome your feedback - please rate us before you leave today and visit our blog for
information after the event.
• Want to be one of our featured speakers? Let us know! techtalks@hpccsystems.com
The Download: Tech Talks #HPCCTechTalks2
3. Community announcements
3
Dr. Flavio Villanustre
VP Technology
RELX Distinguished Technologist
LexisNexis® Risk Solutions
Flavio.Villanustre@lexisnexisrisk.com
The Download: Tech Talks #HPCCTechTalks
• HPCC Systems Platform updates
• 6.4.10-1 is the latest gold version / Community Changelog
• 6.4.12 RC1 coming soon
• 7.0.0 Beta planned for early Q2 – among the key features:
• Spark integration
• Indexer
• Record Translation
• Session Management Improvements
• VS Code Beta version
• Roadmap items for 2018 and beyond
• Latest Blogs
• HPCC Systems/Tableau Web Data Connector v0.2 Tech Preview
• Machine Learning Demystified
• Reminder: 2018 Summer Internship Proposal Period Open
• Interested candidates can submit proposals from the Ideas List
• Visit the Student Wiki for more details
• Deadline to submit is April 6, 2018
• Program runs late May through mid August
• Don’t delay!
4. Today’s speakers
4 The Download: Tech Talks #HPCCTechTalks
Raj Chandrasekaran
CTO & Co-Founder
ClearFunnel
raj@clearfunnel.com
Raj is the CTO/Co-Founder of ClearFunnel, a Big
Data Analytics as a Service Platform Startup,
leading their Product Strategy and Solutions.
ClearFunnel focuses on enabling Marketing
Analytics, Advanced Text Analytics, Bio
Informatics and Image Processing for various
clients in Technology, Maritime, Publishing and
Healthcare domains.
Featured Community Speaker
5. Today’s speakers
5 The Download: Tech Talks #HPCCTechTalks
James McMullan
Software Engineer III
LexisNexis Risk Solutions
James.McMullan@lexisnexisrisk.com
James has a broad range of Software Engineering experience from developing low
level system drivers for X-Ray fluorescence equipment to mobile video games and web
applications. He is a recent addition to the LexisNexis team and is part of an internal
R&D group where he has been working on multiple projects including; HPCC Systems
& Spark benchmarks, integration projects between the HPCC Systems, Spark and
Hadoop ecosystems, and document storage systems.
Bob Foreman
Senior Software Engineer
LexisNexis Risk Solutions
Robert.Foreman@lexisnexisrisk.com
Bob Foreman has worked with the HPCC Systems technology platform and
the ECL programming language for over 5 years, and has been a technical
trainer for over 25 years. He is the developer and designer of the HPCC
Systems Online Training Courses, and is the Senior Instructor for all
classroom and Webex/Lync based training.
6. Scaling Data Science Capabilities:
Leveraging a Homogeneous Big Data Ecosystem
Raj Chandrasekaran
CTO & Co-Founder
ClearFunnel
7. Quick poll:
Where have you had the most success in
deployment of HPCC Systems based solutions?
See poll on bottom of presentation screen
8. To succeed, a Big Data Analytics enterprise needs…
The Download: Tech Talks #HPCCTechTalks8
• An efficient Big Data ecosystem, which comprises the following key
capabilities:
• Big Data Processing
• Data Science: ML & AI
• Cloud Integration
• Leveraging these capabilities for Commercial Advantage
• Key-Success Factor for any Start-up: Cost of Operations and Cash flow
9. Big Data Processing
The Download: Tech Talks #HPCCTechTalks9
• Top of the list: Hadoop and Spark
• Lots of incremental innovations:
• Hadoop: MapReduce, Hive, HBase, Solr, Pig, Kafka, Yarn, Ambari, Ranger, Knox, Atlas, …
• Spark: Hadoop’s Successor, In-Memory, Directed Acyclic Graph – DAG, Stream Processing,
Machine Learning, SparkSQL, GraphX, Support for Python, Java, R and Scala, …
• Which also means, Lots of Integrations and…
• A variety of Engineering Talent
• Still, all of the above = version 1.01 in the HPCC Systems domain
HPCC Systems Capabilities: Big Data Processing
10. Data Science: ML & AI
The Download: Tech Talks #HPCCTechTalks10
• Traditionally, R & Python
• Current State:
• MLLib has a core set of machine learning algorithms, but is certainly not as complete as R or other machine learning
libraries such as MADLib
• SparkR is work-in-progress… you still need a robust ML library to implement advanced Data Science use cases
• ML is also an evolving field in the HPCC Systems domain.
• ECL-ML modules are fully-parallel, and covers both - Supervised and Unsupervised Models
• Extensibility: ECL is natively designed to manage data, and is thereby easily extensible to implement custom ML
algorithms, including Neural Network and Deep Learning.
• ClearFunnel Innovations using ECL-ML:
• Text Processing (Self-learning layered taxonomy, Entity and Topic Extraction, Context Analysis, Point of View Scoring)
• Image Recognition and Pattern Matching (OCR and NN based)
• Maritime Predictive Analytics (Deep Learning with Geospatial and IOT streaming data)
HPCC Systems Capabilities: Big Data Processing Data Science
11. Cloud Integration
The Download: Tech Talks #HPCCTechTalks11
• AWS: The Big Daddy of Cloud
• Core strengths are really EC2 and S3. All other AWS capabilities and micro-services have been built around these 2 foundational
technologies.
• HPCC Systems on AWS:
• HPCC Systems provides native support for AWS (one-click deployment).
• Additionally, HPCC System’s simple, homogeneous tech stack makes it a breeze to operate on cloud with minimal investments in
resources and time.
• ClearFunnel Innovations:
• Spray / De-Spray data between Thor cluster and S3 at speeds of up to 2 TBPS (Netezza’s data transfer rate is 2 – 4 TB/hr)
• Failsafe job operation (recover instantly from any failures)
• Near Real-Time, Micro-batching, Monitoring, Alert, Data Delivery APIs, etc. capabilities by integrating AWS micro-services
and HPCC Systems
• Key Principles:
• Avoid creating layers of abstractions on both ends (AWS and HPCC Systems).
• Instead integrate HPCC Systems directly with core capabilities of EC2 and S3.
HPCC Systems Capabilities: Big Data Processing Data Science Cloud
12. Leveraging HPCC Systems for Commercial Success
The Download: Tech Talks #HPCCTechTalks12
• ClearFunnel has implemented a full-spectrum of complex data engineering use cases
using HPCC Systems:
• Complex and large Graph traversal across nodes
• Image Analytics
• Operational Analytics with Near Real-Time and Stream Processing based Analog data
• Pattern Detection in Bioinformatics
• NLP and advanced Text Analytics
• IOT based sensor-data integration and analytics
• Advanced Search and Querying
• Single, homogeneous tech stack:
• ClearFunnel’s Big Data Analytics Platform runs these diverse use cases with a homogeneous tech stack,
extending HPCC Systems’ capabilities to meet virtually any Big Data processing requirement
13. Key Success Criteria: Cost of Big Data Operations
The Download: Tech Talks #HPCCTechTalks13
• Distinctive Cost benefits from using a Homogeneous tech stack and a highly
productive ECL language
• “Fail-fast, fail-often” and multiple iterations of solution development do not involve a
lot of time, resources, and cost
• Re-use and Refactor core ML & AI modules across use cases (single language
implementation)
• Minimal Cost of Operations:
• ClearFunnel is operating multiple Big Data clusters in Production environment with
hundreds of nodes each, without any dedicated support staff - Cloud Engineer,
Infrastructure Engineer, Network Engineer, Production Support Engineer, Dev Ops
Engineer, or Tech Ops Specialist!
• Enabled with efficient automation and close integration of AWS and HPCC Systems
14. Quick poll:
In your opinion, which of these use cases are most
suitable for implementing in HPCC Systems?
See poll on bottom of presentation screen
17. Quick poll:
Would you be interested in
interacting with the Hadoop
ecosystem from HPCC Systems?
See poll on bottom of presentation screen
18. Overview
• HDFS Connector Motivations
• Why are we making the connector?
• What are our goals for the connector?
• Overview of HDFS Architecture
• How is data stored in HDFS?
• How can we interact with HDFS?
• HDFS Connector Design
• Overview of how the connector works & achieves parallelism
• HDFS Connector Demo
The Download: Tech Talks #HPCCTechTalks18
19. HDFS Connector Motivations
• Interact with HDFS datasets and Hadoop processes
• Existing HPCC to Hadoop (h2h) Project
• No longer maintained
• Chance to improve upon h2h
• Tighter integration with HPCC
• Fewer dependencies
• Fewer failure points
• Possibility for New Features
• Variable length record flat files
• Hadoop File Formats?
The Download: Tech Talks #HPCCTechTalks19
20. HDFS Connector Goals
• Robust – Should “Just Work”
• Straight forward
• Few dependencies
• Little to no configuration
• Tightly integrated
• Datasets from HDFS should be first class citizens
• Performant
• Parallelism where possible
• Reduce data transfer costs
The Download: Tech Talks #HPCCTechTalks20
21. Overview of HDFS Architecture
• How are files stored in HDFS?
• Stored as blocks of data & metadata
• Blocks are usually 64 MiB
• Blocks replicated for fault tolerance
• Namenode
• File metadata
• Filesystem namespace
• Datanodes
• Blocks of data
• No knowledge of files
The Download: Tech Talks #HPCCTechTalks21
Datanodes
Namenode
Metadata
22. Overview of HDFS Architecture
• Reading & Writing in HDFS
• Namenode arbitrates reads & writes
• Datanodes fulfill reads & writes
• Multiple readers / Single writer
• Client Applications
• Java Hadoop or native libHDFS libraries
• Messaging uses Google Protocol Buffers
The Download: Tech Talks #HPCCTechTalks22
Datanodes
Namenode
Client
Application
Client
Application
Client
Application
23. HDFS Connector Design – Communicating with HDFS
• Java Hadoop or native libHDFS library?
• libHDFS relies on the Java Hadoop libraries
• Both require Hadoop to be installed locally
• Google Protocol Buffers?
• Possible but a lot of work
• libHDFS3
• Part of Apache HAWQ
• Completely native implementation of libHDFS
The Download: Tech Talks #HPCCTechTalks23
24. HDFS Connector Design – HPCC Integration
• ECL PIPE?
• High data transfer costs
• Loosely coupled
• Leverages native ECL
• Import Java Library?
• High data transfer costs
• Adds lots of dependencies
• Parallelism is difficult
• Native ECL Plugin?
• Low data transfer costs
• Fewest dependencies
• Parallelism is possible
The Download: Tech Talks #HPCCTechTalks24
25. HDFS Connector Design – Reading Data in Parallel
• CSV Files & Flat Files Fixed Record
• Break HDFS file into logical chunks
• One chunk per HPCC node
• Chunks aren’t record aligned
• Consume records that begin in our chunk
• Variable Record Flat Files
• Need record split metadata
• Create split metadata on write
• Preprocess step if no metadata
The Download: Tech Talks #HPCCTechTalks25
Chunk
Consumed
Records
Chunk
Split Metadata
26. HDFS Connector Design – Writing to HDFS
• HDFS is single writer
• Single File
• Each Thor node writes its data to the file in sequence
• Requires Append mode to be enabled
• Interacts well with existing HDFS ecosystem
• Multiple File Parts
• Similar to how HPCC stores files
• Parallel writing
• Existing Hadoop applications would need to be updated
The Download: Tech Talks #HPCCTechTalks26
27. HDFS Connector Demo – Writing a dataset to HDFS
The Download: Tech Talks #HPCCTechTalks27
28. HDFS Connector Demo – Reading a dataset from HDFS
The Download: Tech Talks #HPCCTechTalks28
29. HDFS Connector Demo – Working with HDFS Datasets
The Download: Tech Talks #HPCCTechTalks29
30. Quick poll:
Do you currently use HDFS as a data
store?
See poll on bottom of presentation screen
32. ECL Tips and Cool Tricks –
Building a Relational Dataset
Bob Foreman
Senior Software Engineer
LexisNexis Risk Solutions
33. Quick poll:
Have you ever worked with a relational
denormalized dataset in ECL?
See poll on bottom of presentation screen
34. Background
• Most of our datasets on an HPCC cluster are organized in a normalized
architecture.
• A unique linking field in one dataset can be used to join with other datasets
using a one-to-one or a one-to-many relationship.
• In LexisNexis we affectionately refer to this architecture as the “Data Donut”
The Download: Tech Talks #HPCCTechTalks34
35. The LN Data “Donut”
DID
ADL
IDL
LinkID
LexID
The Download: Tech Talks #HPCCTechTalks35
Sometimes, analyzing or
querying this
normalized data can be
challenging.
Enter the
“denormalized”
dataset!
36. Given a sample 3-level hierarchical relational
database:
People
Vehicle
Property
Taxdata
Example Data:
The Download: Tech Talks #HPCCTechTalks36
37. Denormalizing Related Data:
People Vehicles Vehicles Vehicles
Property Taxdata Taxdata Taxdata
Property Taxdata Taxdata Taxdata
Start One Record
End of record
The Download: Tech Talks #HPCCTechTalks37
(Continued)
(Continued)
39. DENORMALIZE(parentoutput,childrecset,condition,transform)
parentoutput – The set of parent records already formatted as the result of the
combination.
childrecset – The set of child records to process.
condition – An expression that specifies how to match records between the
parent and child records.
transform – The TRANSFORM function to call.
The DENORMALIZE function forms flat file records from a parent and any number
of children.
The transform function must take at least 2 parameters: a LEFT record of the
same format as the resulting combined parent and child records, and a RIGHT
record of the same format as the childrecset. An optional integer COUNTER
parameter can be included which indicates the current iteration of child record.
DENORMALIZE Function:
The Download: Tech Talks #HPCCTechTalks39
40. Implicit Dataset Relationality
(Nested child datasets):
Parent record fields are always in memory when operating at the
level of the Child
You may only reference the related set of Child records when
operating at the level of the Parent
People
Vehicle
Property
Taxdata
Querying Relational Data:
The Download: Tech Talks #HPCCTechTalks40
41. NORMALIZE(recordset, expression, transform)
recordset – The set of records to process.
expression – An numeric expression specifying the total number of times to call
the transform for that record.
transform – The TRANSFORM function to call for each record in the recordset.
The NORMALIZE function processes through all the records in the recordset
performing the transform function the expression number of times on each
record in turn to produce relational child records of the parent.
The transform function must take two parameters: A LEFT record of the same
format as the recordset, and an integer COUNTER specifying the number of times
to call the transform for that record. The format of the resulting recordset can be
different from the input.
NORMALIZE Function
The Download: Tech Talks #HPCCTechTalks41
43. Summary
• Using a denormalized dataset can improve the power of your queries and
discover hidden relationship in the data.
• ECL has powerful and easy support in moving from a normalized to a
denormalized format when needed.
• Knowing how to move both ways and the best practices in doing so is a good
skill to have for all ECL developers.
The Download: Tech Talks #HPCCTechTalks43
44. In closing: LOVE YOUR DATA!
The Download: Tech Talks #HPCCTechTalks44
45. Quick poll:
After today’s ECL Tech Tip, will you use
DENORMALIZE for any advanced query
applications?
See poll on bottom of presentation screen
47. • Have a new success story to share?
• Want to pitch a new use case?
• Have a new HPCC Systems application you want to demo?
• Want to share some helpful ECL tips and sample code?
• Have a new suggestion for the roadmap?
• Be a featured speaker for an upcoming episode! Email your idea to
Techtalks@hpccsystems.com
• Visit The Download Tech Talks wiki for more information:
https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Tech+Talks
Mark your calendar for the March 15 Tech Talk -
More machine learning topics coming!
Watch our Events page for details.
Submit a talk for an upcoming episode!
47 The Download: Tech Talks #HPCCTechTalks
48. A copy of this presentation will be made available soon on our blog:
hpccsystems.com/blog
Thank You!