SlideShare una empresa de Scribd logo
Department of Home Affairs
Hadoop:
The Unintended Benefits
Steven O’Neill, Director EDW Platforms
Dwane Hall, Hadoop Developer
February 2019
Department of Home Affairs
• The Department of Home Affairs
• Hadoop at the Department
• Unintended Benefits?
– Solr and EDW Offload
– Graph and Data Science
Hadoop: The Unintended Benefits [UNCLASSIFIED] 2
Agenda
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 3
The Department of Home Affairs
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 4
The Department of Home Affairs
Overview 2017-18
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 5
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 6
Passenger Systems, Teradata and Hadoop
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 7
Early Learnings
• Keep it simple
• Run your own race
• Use your existing standards
• Vendor engagement
• Expect rework
Department of Home Affairs
• Initial use-cases are working
• Once you have done the hard work to get going, the hard work really starts:
– Use cases become more complex
– Number of tools increases
– Expectations build
• Expectations. There’s a conflict between:
– The possible
– The in-progress
– The in-Production
Hadoop: The Unintended Benefits [UNCLASSIFIED] 8
Next Steps
… and the unintended benefits of using an open, multi-purpose platform
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 9
Solr: Search Requirements
Search a lot of free text rows quickly
• Billions of rows
• Business require all of history (i.e. no date filters or partitions)
• ‘Like’/wildcard searching required
Current solution /problems
• Build integrated data on Teradata EDW
• Perform full table searches using %like% clause
• Can be slow when searching (i.e. >4 billion row tables, most tables >100 million rows)
Examples:
• Select * from Contact where Name like ‘%Jonesy@skiclub.com%’;
• Select * from Imports where Goods like ‘%snowboard%’;
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 10
Solr – A Quick Overview
ID Description
123 Garmin Forerunner
456 Snowboard bindings
789 Hadoop for Dummies
888 Snowboard
999 Garmin
Description ID
Garmin 123, 999
Forerunner 123, 143
Snowboard 456, 888
Bindings 789, 1001
Hadoop 789, 5421, 652
Select *
from Imports
where Description like ‘%snowboard%’;
http://hadoop:3000/solr/imports/select?
q=description:snowboard
- Full Table Scan (ALL AMPS)
- Like Wildcard search (can’t use secondary
index)
- If table is large (billions of rows) then
process can be slow if no date partition
(PPI) is specified
- Inverted Index lookup (fast)
- Ideally you will end up with less terms than
documents
- Additional functionality (soundex / phonetic /
faceting / autocomplete) comes with SOLR /
Lucene
RDMS: SOLR:
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 11
Solr with Hadoop
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 12
Solr + Data Service
TERADATA
Data
Services
INFINIBAND / BYNET
D
E
L
T
A
INTEGRATE
1. SOLR search
(return ID)
2. Teradata lookup
(return detail)ACQUIRE
Web App
Department of Home Affairs
• …Solr has no LDAP authentication out of the box
• Dynamic and not temporal index
How we do it?
• Multi-layered
• Kerberos and Ranger for indexes in HDFS (Not Kerberos Authentication Plugin)
• SSL
• Solr Basic Authentication and Rule-Based Authorisation Plugin
Hadoop: The Unintended Benefits [UNCLASSIFIED] 13
Security - Relational to Unstructured
Department of Home Affairs
• Collection based – document level in the future
• All result payloads are logged and stored with a UUID
• Data Services for user and audit logging - Traceability
Hadoop: The Unintended Benefits [UNCLASSIFIED] 14
Security Detail
Department of Home Affairs
• Extensive ecosystem which can be overwhelming. Start simple.
• DEV environment representative of HLE’s (Multi-Node, Security)
• Plan for entire index rebuilds
• Use collection aliases, request params api
• Manage expectations - Solr is not a database
• Index definition affects performance. Test at scale (Jmeter, SolrMeter)
• Rework python script (HDF Kafka + NiFi)
• Devops for deployment
• Autoscaling, document level security
Hadoop: The Unintended Benefits [UNCLASSIFIED] 15
Lessons Learned + Next Steps
Department of Home Affairs
• Identifying networks of persons of interest
• Predictive analytics using demographics
• Automation of threat discovery
• Process complex relationships
• Identify soft links between entities
• Interactive tooling for analysts
• For use by:
– Data scientists
– Analysts
Hadoop: The Unintended Benefits [UNCLASSIFIED] 16
JanusGraph
Use Cases
Department of Home Affairs
• Scalable: Optimised for storing billions of vertices and edges
• Works with the Hadoop ecosystem
• Can perform fuzzy searching and graph traversal
• Fast: Sub-second query performance response times
• Supports unstructured, semi-structured and structured data
• Supported by a variety of visualisation tools
• Backed by Hortonworks and Google
• Fit with our “Open Source First” principle
Hadoop: The Unintended Benefits [UNCLASSIFIED] 17
JanusGraph
Why JanusGraph?
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 18
JanusGraph
Architecture
Department of Home Affairs
Method Results Records per Sec
Single-threaded batch process load into HBase via JanusGraph Poor Low ‘000s
Parallelised load into HBase via JanusGraph Good 10k+ per mapper
Bulk Spark OLAP load into HBase via JanusGraph Very Good 10k+
Parallelised load into Solr through the python Teradata Script Operator Good 10k+
NiFi streaming load process into HBase/Solr Excellent 10k+ constant
stream
Hadoop: The Unintended Benefits [UNCLASSIFIED] 19
JanusGraph
Load Techniques and Performance
To keep the graph clean, the existence of vertices and edges were checked when loading
Department of Home Affairs
• Authentication
– The authentication of Hadoop services are secured using Kerberos.
• Authorisation
– Authorisation of data and services are governed through:
– Access Control Lists
– Ranger policies
• Encryption
– TLS encryption is enabled to enforce secure communication channels.
– Encryption of data at rest enforced at the appliance level.
• Firewall
– IP table rules applied across the cluster.
Hadoop: The Unintended Benefits [UNCLASSIFIED] 20
JanusGraph
Security Hardening
Department of Home Affairs
• Visualisation tools:
– R/Shiny for data science
– Linkurious for analysts
• Graph analytics:
– Timeline
– Path
– Connectivity
– Community
– Centrality
– PageRank
Hadoop: The Unintended Benefits [UNCLASSIFIED] 21
JanusGraph
Visualisation
Solr search
JanusGraph traversal
Department of Home Affairs
• Collaboration between Data Engineering and Data Science is invaluable
• Start with a generic design
• Load data using simple batch methods before using advanced streaming methods
• Open source incubator projects can be buggy
• Issues were encountered when using a JanusGraph mixed index against a secured Solr
environment
• JanusGraph is still not part of the core Hadoop distribution and stack
• Start in a discovery environment, before proceeding to a 24/7 supported production
environment
Hadoop: The Unintended Benefits [UNCLASSIFIED] 22
JanusGraph
Lesson learned
Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 23
Questions?

Más contenido relacionado

La actualidad más candente

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFiThe First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
DataWorks Summit
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
Hadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and FutureHadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and Future
DataWorks Summit
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
DataWorks Summit
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
DataWorks Summit
 
Data Centric Transformation in Telecom
Data Centric Transformation in TelecomData Centric Transformation in Telecom
Data Centric Transformation in Telecom
DataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Cécile Poyet
 

La actualidad más candente (20)

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFiThe First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Curing the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging ManagerCuring the Kafka Blindness – Streams Messaging Manager
Curing the Kafka Blindness – Streams Messaging Manager
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Hadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and FutureHadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and Future
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
 
Data Centric Transformation in Telecom
Data Centric Transformation in TelecomData Centric Transformation in Telecom
Data Centric Transformation in Telecom
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 

Similar a Hadoop: The Unintended Benefits

Customer Presentation
Customer PresentationCustomer Presentation
Customer Presentation
Splunk
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
ClouderaUserGroups
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
Mike Pittaro
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
Splunk
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
Splunk
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
Gustavo Rene Antunez
 
Denver Big Data Analytics Day
Denver Big Data Analytics DayDenver Big Data Analytics Day
Denver Big Data Analytics Day
Zivaro Inc
 
Splunk hunkbeta
Splunk hunkbetaSplunk hunkbeta
Splunk hunkbeta
Ahnku Toh
 
Using Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedInUsing Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedIn
Sasha Ovsankin
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in production
Daniel Coupal
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic
IntelAPAC
 

Similar a Hadoop: The Unintended Benefits (20)

Customer Presentation
Customer PresentationCustomer Presentation
Customer Presentation
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
 
Data Science
Data ScienceData Science
Data Science
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
 
Denver Big Data Analytics Day
Denver Big Data Analytics DayDenver Big Data Analytics Day
Denver Big Data Analytics Day
 
Splunk hunkbeta
Splunk hunkbetaSplunk hunkbeta
Splunk hunkbeta
 
Using Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedInUsing Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedIn
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in production
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic
 

Más de DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Último (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Hadoop: The Unintended Benefits

  • 1. Department of Home Affairs Hadoop: The Unintended Benefits Steven O’Neill, Director EDW Platforms Dwane Hall, Hadoop Developer February 2019
  • 2. Department of Home Affairs • The Department of Home Affairs • Hadoop at the Department • Unintended Benefits? – Solr and EDW Offload – Graph and Data Science Hadoop: The Unintended Benefits [UNCLASSIFIED] 2 Agenda
  • 3. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 3 The Department of Home Affairs
  • 4. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 4 The Department of Home Affairs Overview 2017-18
  • 5. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 5
  • 6. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 6 Passenger Systems, Teradata and Hadoop
  • 7. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 7 Early Learnings • Keep it simple • Run your own race • Use your existing standards • Vendor engagement • Expect rework
  • 8. Department of Home Affairs • Initial use-cases are working • Once you have done the hard work to get going, the hard work really starts: – Use cases become more complex – Number of tools increases – Expectations build • Expectations. There’s a conflict between: – The possible – The in-progress – The in-Production Hadoop: The Unintended Benefits [UNCLASSIFIED] 8 Next Steps … and the unintended benefits of using an open, multi-purpose platform
  • 9. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 9 Solr: Search Requirements Search a lot of free text rows quickly • Billions of rows • Business require all of history (i.e. no date filters or partitions) • ‘Like’/wildcard searching required Current solution /problems • Build integrated data on Teradata EDW • Perform full table searches using %like% clause • Can be slow when searching (i.e. >4 billion row tables, most tables >100 million rows) Examples: • Select * from Contact where Name like ‘%Jonesy@skiclub.com%’; • Select * from Imports where Goods like ‘%snowboard%’;
  • 10. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 10 Solr – A Quick Overview ID Description 123 Garmin Forerunner 456 Snowboard bindings 789 Hadoop for Dummies 888 Snowboard 999 Garmin Description ID Garmin 123, 999 Forerunner 123, 143 Snowboard 456, 888 Bindings 789, 1001 Hadoop 789, 5421, 652 Select * from Imports where Description like ‘%snowboard%’; http://hadoop:3000/solr/imports/select? q=description:snowboard - Full Table Scan (ALL AMPS) - Like Wildcard search (can’t use secondary index) - If table is large (billions of rows) then process can be slow if no date partition (PPI) is specified - Inverted Index lookup (fast) - Ideally you will end up with less terms than documents - Additional functionality (soundex / phonetic / faceting / autocomplete) comes with SOLR / Lucene RDMS: SOLR:
  • 11. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 11 Solr with Hadoop
  • 12. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 12 Solr + Data Service TERADATA Data Services INFINIBAND / BYNET D E L T A INTEGRATE 1. SOLR search (return ID) 2. Teradata lookup (return detail)ACQUIRE Web App
  • 13. Department of Home Affairs • …Solr has no LDAP authentication out of the box • Dynamic and not temporal index How we do it? • Multi-layered • Kerberos and Ranger for indexes in HDFS (Not Kerberos Authentication Plugin) • SSL • Solr Basic Authentication and Rule-Based Authorisation Plugin Hadoop: The Unintended Benefits [UNCLASSIFIED] 13 Security - Relational to Unstructured
  • 14. Department of Home Affairs • Collection based – document level in the future • All result payloads are logged and stored with a UUID • Data Services for user and audit logging - Traceability Hadoop: The Unintended Benefits [UNCLASSIFIED] 14 Security Detail
  • 15. Department of Home Affairs • Extensive ecosystem which can be overwhelming. Start simple. • DEV environment representative of HLE’s (Multi-Node, Security) • Plan for entire index rebuilds • Use collection aliases, request params api • Manage expectations - Solr is not a database • Index definition affects performance. Test at scale (Jmeter, SolrMeter) • Rework python script (HDF Kafka + NiFi) • Devops for deployment • Autoscaling, document level security Hadoop: The Unintended Benefits [UNCLASSIFIED] 15 Lessons Learned + Next Steps
  • 16. Department of Home Affairs • Identifying networks of persons of interest • Predictive analytics using demographics • Automation of threat discovery • Process complex relationships • Identify soft links between entities • Interactive tooling for analysts • For use by: – Data scientists – Analysts Hadoop: The Unintended Benefits [UNCLASSIFIED] 16 JanusGraph Use Cases
  • 17. Department of Home Affairs • Scalable: Optimised for storing billions of vertices and edges • Works with the Hadoop ecosystem • Can perform fuzzy searching and graph traversal • Fast: Sub-second query performance response times • Supports unstructured, semi-structured and structured data • Supported by a variety of visualisation tools • Backed by Hortonworks and Google • Fit with our “Open Source First” principle Hadoop: The Unintended Benefits [UNCLASSIFIED] 17 JanusGraph Why JanusGraph?
  • 18. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 18 JanusGraph Architecture
  • 19. Department of Home Affairs Method Results Records per Sec Single-threaded batch process load into HBase via JanusGraph Poor Low ‘000s Parallelised load into HBase via JanusGraph Good 10k+ per mapper Bulk Spark OLAP load into HBase via JanusGraph Very Good 10k+ Parallelised load into Solr through the python Teradata Script Operator Good 10k+ NiFi streaming load process into HBase/Solr Excellent 10k+ constant stream Hadoop: The Unintended Benefits [UNCLASSIFIED] 19 JanusGraph Load Techniques and Performance To keep the graph clean, the existence of vertices and edges were checked when loading
  • 20. Department of Home Affairs • Authentication – The authentication of Hadoop services are secured using Kerberos. • Authorisation – Authorisation of data and services are governed through: – Access Control Lists – Ranger policies • Encryption – TLS encryption is enabled to enforce secure communication channels. – Encryption of data at rest enforced at the appliance level. • Firewall – IP table rules applied across the cluster. Hadoop: The Unintended Benefits [UNCLASSIFIED] 20 JanusGraph Security Hardening
  • 21. Department of Home Affairs • Visualisation tools: – R/Shiny for data science – Linkurious for analysts • Graph analytics: – Timeline – Path – Connectivity – Community – Centrality – PageRank Hadoop: The Unintended Benefits [UNCLASSIFIED] 21 JanusGraph Visualisation Solr search JanusGraph traversal
  • 22. Department of Home Affairs • Collaboration between Data Engineering and Data Science is invaluable • Start with a generic design • Load data using simple batch methods before using advanced streaming methods • Open source incubator projects can be buggy • Issues were encountered when using a JanusGraph mixed index against a secured Solr environment • JanusGraph is still not part of the core Hadoop distribution and stack • Start in a discovery environment, before proceeding to a 24/7 supported production environment Hadoop: The Unintended Benefits [UNCLASSIFIED] 22 JanusGraph Lesson learned
  • 23. Department of Home Affairs Hadoop: The Unintended Benefits [UNCLASSIFIED] 23 Questions?