SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Workshop
Hortonworks. We do Hadoop.
1/29/2013
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
•  Hortonworks Data Platform 2.2
•  Apache Solr
•  Query & Ingest Documents with Apache Solr
•  Solr & Hadoop
•  Index on HDFS
•  MapReduce, Hive & Pig
•  Solr Cloud
•  Sizing
•  Demo
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2: Reliable, Consistent & Current
HDP is Apache Hadoop not “based on” Hadoop
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search
HDP 2.2 contains support for:
•  Apache Solr 4.10 with Lucense
•  Banana (Time-series visualization)
•  Lucidworks Hadoop connector
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Solr
•  A system built to search text
•  A specialized type of database management System
•  A platform to build search applications on
•  Customizable, open source software
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Apache Solr
Specialized tools do the job better!
•  Solr performs much better, for text search, than a relational database
•  Solr knows about languages
» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς
•  Solr has features specific to text search,
» E.g. highlighting search results
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Where does Apache Solr fit?
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s Architecture
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s inner Architecture
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics Of Inverted Index
Doc ID Content
1 I like dog
2 I like cat
3 I like dog and
cat1
Term Doc ID
I 1,2,3
like 1,2,3
dog 1,3
cat 2,3
Document Inverted Index
Question
Find documents with Dog & Cat
Answer
Intersect the index for dog and cat
(1,3) (2,3) = 3
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SOLR Indexing
•  Define document structure using
schema.xml
•  Convert document from source
format to a format supported by solr
(xml, json, csv)
•  Add Documents to SOLR
<doc>
<field name="id">1</field>
<field name="screen_name">@thelabdude</field>
<field name=”cat">post</field>
</doc>
Sample Document in XML Format
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr’s schema & fields
Before adding documents to Solr, you need to specify the schema, represented in a file called
schema.xml.
The schema declares:
-  Fields
-  Field used as the unique/primary key
-  Field type
-  How to index and search each a field
Field Types
In Solr, every field has a type. E.g.: float, long, double, date, text
Defining a field:
<field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema
Like a regular field except it has a name with a wildcard in it.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
For more field details see: http://wiki.apache.org/solr/SchemaXml
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query & Index Documents with Solr
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Adding & Deleting From SOLR
Solr offers a REST like interface for indexing and searching:
Add to Index:
curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary
@books.json -H 'Content-type:application/json'
Delete from Index:
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></
delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/
xml; charset=utf-8'
csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How To Query
Solr offers a REST like interface for indexing and searching:
Query
http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Java API (SolrJ)
Index:
SolrServer server = new HttpSolrServer("http://HOST:8983/solr/");
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "id", "id1", 1.0f );
doc1.addField( "name", "doc1", 1.0f );
server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add()
server.commit();
Query:
SolrQuery solrQuery = new SolrQuery().setQuery("ipod”);
QueryResponse rsp = server.query(solrQuery);
Iterator<SolrDocument> iter = rsp.getResults().iterator();
while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
String content = (String) resultDoc.getFieldValue("content");
}
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr & Hadoop
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search : Deployment Options
Page 22
Configuration Advantages Disadvantages
Solr deployed in an
independent cluster
•  Scale independently
•  Scale easily for increased query volume
•  No need to carefully orchestrate resource
allocations among workloads, indexing, and
querying
•  Multiple clusters to admin and manage
Solr index deployed on
HDFS node
•  Single cluster to administration / manage
•  Leverages Hadoop file system advantages
•  Not supported for kerberized cluster
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
Page 23
1
2
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Scalable Indexing In HDFS Using Lucidworks Hadoop connector
• MapReduce job
– CSV
– Microsoft Office files
– Grok (log data)
– Zip
– Solr XML
– Seq files
– WARC
• Apache Pig & Hive
– Write your own pig/hive scripts
to index content
– Use hive/pig for
preprocessing and joining
– Output the resulting datasets
to Solr
HDFS
MapReduce or Pig Job
Solr
Raw Documents Lucene Indexes
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ingest csv files using Map Reduce
Scenario I: Ingest CSV data stored on local disk
ot@127.0.0.1:/root/csv
java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/
2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/
2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-
job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*"
com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,
1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -
DcsvDelimiter="|" -Dlww.commit.on.close=true -cls
com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of
com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://
172.16.227.204:8983/solr
Page 25
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Solr Index in Hive
Scenario II: Query index data via Hive
CREATE EXTERNAL TABLE solr (id string, location string,
event_timestamp String, deviceid String, heartrate BigInt,
user String) STORED BY
'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/
tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://
172.16.227.204:8983/solr', 'solr.collection' = 'hr',
'solr.query' = '*:*');
SELECT user,heartrate FROM solr
Page 26
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Index existing Hive-data
Scenario III: Index data stored in Hive
(copy legacydata.csv to hdfs:///user/guest/legacy)
CREATE TABLE legacyhr (id string, location string, time
string, hr int, user string) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '|';
LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;
INSERT INTO TABLE solr SELECT id, location, time,
'nodevice', hr, user FROM legacyhr;
Page 27
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transform & Index documents with Pig
Scenario IV: Transform and index data stored on HDFS
Data.pig
REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';
set solr.collection '$collection';
A = load '/user/guest/pigdata' using PigStorage(';') as
(id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l
:long,user_s:chararray);
–  — ID comes first, then field name, value
B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2,
'deviceid', $3, 'heartrate', $4, 'user', $5;
ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();
pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig
Page 28
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Cloud
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SolrCloud
Apache Solr includes Fault Tolerance & High Availability:
SolrCloud
•  Distributed indexing
•  Distributed search
•  Central configuration for the entire cluster
•  Automatic load balancing and fail-over for queries
•  ZooKeeper integration for cluster coordination and configuration.
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing Guidelines (Handle With Care)
• 100-250 Million docs per solr server
• 4 solr servers per physical machine
– Physical Machine – 20 cores, 128GB RAM
• Queries/s in double digit ms response time up to 30.
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Demo

Más contenido relacionado

La actualidad más candente

Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Protecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopProtecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopDataWorks Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHortonworks
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHortonworks
 
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Cedric CARBONE
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderHortonworks
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 

La actualidad más candente (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Protecting enterprise Data in Hadoop
Protecting enterprise Data in HadoopProtecting enterprise Data in Hadoop
Protecting enterprise Data in Hadoop
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Hadoop and Spark – Perfect Together
Hadoop and Spark – Perfect TogetherHadoop and Spark – Perfect Together
Hadoop and Spark – Perfect Together
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 

Destacado

Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...Hortonworks
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIOhhyin
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Hortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
Data Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache HadoopData Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache HadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubThe Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubCloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Athemaster Co., Ltd.
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 

Destacado (20)

Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIO
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Hortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBox
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Data Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache HadoopData Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache Hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubThe Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 

Similar a Hortonworks Technical Workshop - HDP Search

HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)Chris Casano
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 

Similar a Hortonworks Technical Workshop - HDP Search (20)

HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 

Más de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Más de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Hortonworks Technical Workshop - HDP Search

  • 1. Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search Workshop Hortonworks. We do Hadoop. 1/29/2013
  • 2. Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda •  Hortonworks Data Platform 2.2 •  Apache Solr •  Query & Ingest Documents with Apache Solr •  Solr & Hadoop •  Index on HDFS •  MapReduce, Hive & Pig •  Solr Cloud •  Sizing •  Demo
  • 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2
  • 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  • 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2: Reliable, Consistent & Current HDP is Apache Hadoop not “based on” Hadoop
  • 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search HDP 2.2 contains support for: •  Apache Solr 4.10 with Lucense •  Banana (Time-series visualization) •  Lucidworks Hadoop connector
  • 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr
  • 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Solr •  A system built to search text •  A specialized type of database management System •  A platform to build search applications on •  Customizable, open source software
  • 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Apache Solr Specialized tools do the job better! •  Solr performs much better, for text search, than a relational database •  Solr knows about languages » E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς •  Solr has features specific to text search, » E.g. highlighting search results
  • 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Where does Apache Solr fit?
  • 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s Architecture
  • 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s inner Architecture
  • 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Basics Of Inverted Index Doc ID Content 1 I like dog 2 I like cat 3 I like dog and cat1 Term Doc ID I 1,2,3 like 1,2,3 dog 1,3 cat 2,3 Document Inverted Index Question Find documents with Dog & Cat Answer Intersect the index for dog and cat (1,3) (2,3) = 3
  • 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SOLR Indexing •  Define document structure using schema.xml •  Convert document from source format to a format supported by solr (xml, json, csv) •  Add Documents to SOLR <doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field> </doc> Sample Document in XML Format
  • 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr’s schema & fields Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: -  Fields -  Field used as the unique/primary key -  Field type -  How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
  • 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Dynamic Fields Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml
  • 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query & Index Documents with Solr
  • 18. Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json' Delete from Index: curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></ delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/ xml; charset=utf-8' csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
  • 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How To Query Solr offers a REST like interface for indexing and searching: Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
  • 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Java API (SolrJ) Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit(); Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }
  • 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr & Hadoop
  • 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search : Deployment Options Page 22 Configuration Advantages Disadvantages Solr deployed in an independent cluster •  Scale independently •  Scale easily for increased query volume •  No need to carefully orchestrate resource allocations among workloads, indexing, and querying •  Multiple clusters to admin and manage Solr index deployed on HDFS node •  Single cluster to administration / manage •  Leverages Hadoop file system advantages •  Not supported for kerberized cluster
  • 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to store Solr’s index on HDFS? Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType> Page 23 1 2
  • 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job – CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC • Apache Pig & Hive – Write your own pig/hive scripts to index content – Use hive/pig for preprocessing and joining – Output the resulting datasets to Solr HDFS MapReduce or Pig Job Solr Raw Documents Lucene Indexes
  • 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Ingest csv files using Map Reduce Scenario I: Ingest CSV data stored on local disk ot@127.0.0.1:/root/csv java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/ 2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/ 2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws- job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id, 1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user - DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http:// 172.16.227.204:8983/solr Page 25
  • 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Solr Index in Hive Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/ tmp/solr' TBLPROPERTIES('solr.server.url' = 'http:// 172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*'); SELECT user,heartrate FROM solr Page 26
  • 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Index existing Hive-data Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy) CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR; INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr; Page 27
  • 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS Data.pig REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar'; set solr.collection '$collection'; A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l :long,user_s:chararray); –  — ID comes first, then field name, value B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5; ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc(); pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig Page 28
  • 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Cloud
  • 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SolrCloud Apache Solr includes Fault Tolerance & High Availability: SolrCloud •  Distributed indexing •  Distributed search •  Central configuration for the entire cluster •  Automatic load balancing and fail-over for queries •  ZooKeeper integration for cluster coordination and configuration.
  • 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing
  • 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing Guidelines (Handle With Care) • 100-250 Million docs per solr server • 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM • Queries/s in double digit ms response time up to 30.
  • 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Demo