SlideShare una empresa de Scribd logo
1 de 16
Web Scraping Using Nutch and Solr
● A simple example of using open source code
● Web Scrape a single web site - ours
● Environment and code
– Using Centos V6.2 ( Linux )
– Apache Nutch 1.6
– Solr 4.2.1
– Java 1.6
Nutch and Solr Architecture
● Nutch processes urls and feeds content to Solr
● Solr indexes content
Where to get source code
● Nutch
– http://nutch.apache.org
● Solr
– http://lucene.apache.org/solr
● Java
– http://java.com
Installing Source - Nutch
● Nutch is delivered as
– apache-nutch-1.6-bin.tar ( 64M )
– apache-nutch-1.6-src.tar ( 20M )
● Copy each tar file to your desired location
● Install each tar file as
– tar xvf <tar file>
● Second tar file optional
Installing Source - Solr
● Solr is delivered as
– solr-4.2.1.zip ( 116M )
● Copy file to your desired location
● Install each tar file as
– unzip <zip file>
Configuring Nutch Part 1
● Assuming we will crawl a single web site
● Ensure that JAVA_HOME is set
● cd apache-nutch-1.6
● Edit agent name in conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
● mkdir -p urls ; cd urls ; touch seed.txt
Configuring Nutch Part 2
● Add following url ( ours ) to seed.txt
– http://www.semtech-solutions.co.nz
● Change url filtering in conf/regex-urlfilter.txt, change the line
– # accept anything else
– +.
– To be
– +^http://([a-z0-9]*.)*semtech-solutions.co.nz/
● This means that we will filter the urls found to only be from the
local site
Configuring Solr Part 1
● cd solr-4.2.1/example/solr/collection1/conf
● Add some extra fields to schema.xml after _version_ field i.e.
Start Solr Server – Part 1
● Within solr-4.2.1/example
● Run the following command
● java -jar start.jar
● Now try to access admin web page for solr
– http://localhost:8983/solr/admin
● You should now see the admin web site
– ( see next page )
Start Solr Server – Part 2
● Solr Admin web page
Run Nutch / Solr
● We are ready to crawl our first web site
● Go to apache-nutch-1.6 directory
● Run the following commands
– touch nutch_start.bash
– chmod 755 nutch_start.bash
– vi nutch_start.bash
● Add the text to the file
#!/bin/bash
bin/nutch crawl urls -solr http://localhost:8983/solr/ 
-dir crawl -depth 3 -topN 3
Run Nutch / Solr
● Now run the nutch bash file
– ./nutch_start.bash
● Select the Logging option on the admin console
● Monitor for errors in Logging console
● The crawl should finish with no errors and the line
– Crawl finished: crawl
– In the crawl window
Check Crawled Data
● Now we check the data that we have crawled
● In Admin Console window
– Set Core Selector to collection1
– Select the Query option
– Click execute query button
● You should now see some of the data that you have crawled
Crawled Data
● Crawled data in solr query
Crawled Data
● Thats your first simple crawl completed
● Further reading at
– http://nutch.apache.org
– http://lucene.apache.org/solr
● Now you can
– Add more urls to your seed.txt
– Increase the depth of your link search via options
● -depth
● -topN
– Modify your url filtering
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Más contenido relacionado

La actualidad más candente

Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockMichael Stack
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitabial
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUAmir Sedighi
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a clusterXuhong Zhang
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab AssignmentFarzad Nozarian
 
Introduction to Mesos
Introduction to MesosIntroduction to Mesos
Introduction to Mesoskoboltmarky
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Amir Sedighi
 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performanceForthscale
 
D8 configuration migration
D8 configuration migrationD8 configuration migration
D8 configuration migrationViktor Likin
 

La actualidad más candente (20)

Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with Clusterdock
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
 
Perl Programming - 04 Programming Database
Perl Programming - 04 Programming DatabasePerl Programming - 04 Programming Database
Perl Programming - 04 Programming Database
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Chef
ChefChef
Chef
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a cluster
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Introduction to Mesos
Introduction to MesosIntroduction to Mesos
Introduction to Mesos
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)
 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performance
 
D8 configuration migration
D8 configuration migrationD8 configuration migration
D8 configuration migration
 

Destacado

Migration from FAST ESP to Solr
Migration from FAST ESP to SolrMigration from FAST ESP to Solr
Migration from FAST ESP to SolrTNR Global
 
Solr installation
Solr installationSolr installation
Solr installationZHAO Sam
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Frameworksixtyone
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 

Destacado (14)

Migration from FAST ESP to Solr
Migration from FAST ESP to SolrMigration from FAST ESP to Solr
Migration from FAST ESP to Solr
 
Solr installation
Solr installationSolr installation
Solr installation
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Apache ManifoldCF
Apache ManifoldCFApache ManifoldCF
Apache ManifoldCF
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 

Similar a Web scraping with nutch solr

Nginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on UbuntuNginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on Ubuntuabdullah roomi
 
OpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on TutorialOpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on TutorialOpenNebula Project
 
Android for Embedded Linux Developers
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux DevelopersOpersys inc.
 
Setting up LAMP for Linux newbies
Setting up LAMP for Linux newbiesSetting up LAMP for Linux newbies
Setting up LAMP for Linux newbiesShabir Ahmad
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpSander Temme
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsMichael Zhang
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialJason Terpko
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017Antonios Giannopoulos
 
Docker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF MeetupDocker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF MeetupDocker, Inc.
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby Michelle Antebi
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael CrosbyDocker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael CrosbyDocker, Inc.
 
Deployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS LinuxDeployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS LinuxWO Community
 
Crikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor WorkshopCrikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor WorkshopVelocidex Enterprises
 

Similar a Web scraping with nutch solr (20)

Apache1.ppt
Apache1.pptApache1.ppt
Apache1.ppt
 
Nginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on UbuntuNginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on Ubuntu
 
Penetration Testing Boot CAMP
Penetration Testing Boot CAMPPenetration Testing Boot CAMP
Penetration Testing Boot CAMP
 
OpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on TutorialOpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on Tutorial
 
les01.pdf
les01.pdfles01.pdf
les01.pdf
 
Android for Embedded Linux Developers
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux Developers
 
Apache
ApacheApache
Apache
 
Setting up LAMP for Linux newbies
Setting up LAMP for Linux newbiesSetting up LAMP for Linux newbies
Setting up LAMP for Linux newbies
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling Up
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale Environments
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale Environments
 
Sharded cluster tutorial
Sharded cluster tutorialSharded cluster tutorial
Sharded cluster tutorial
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
 
Docker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF MeetupDocker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF Meetup
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael CrosbyDocker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
 
Deployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS LinuxDeployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS Linux
 
Crikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor WorkshopCrikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor Workshop
 
Virt monitoring
Virt monitoringVirt monitoring
Virt monitoring
 

Más de Mike Frampton (20)

Apache Airavata
Apache AiravataApache Airavata
Apache Airavata
 
Apache MADlib AI/ML
Apache MADlib AI/MLApache MADlib AI/ML
Apache MADlib AI/ML
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Apache Gobblin
Apache GobblinApache Gobblin
Apache Gobblin
 
Apache Singa AI
Apache Singa AIApache Singa AI
Apache Singa AI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
OrientDB
OrientDBOrientDB
OrientDB
 
Prometheus
PrometheusPrometheus
Prometheus
 
Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Web scraping with nutch solr

  • 1. Web Scraping Using Nutch and Solr ● A simple example of using open source code ● Web Scrape a single web site - ours ● Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6
  • 2. Nutch and Solr Architecture ● Nutch processes urls and feeds content to Solr ● Solr indexes content
  • 3. Where to get source code ● Nutch – http://nutch.apache.org ● Solr – http://lucene.apache.org/solr ● Java – http://java.com
  • 4. Installing Source - Nutch ● Nutch is delivered as – apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) ● Copy each tar file to your desired location ● Install each tar file as – tar xvf <tar file> ● Second tar file optional
  • 5. Installing Source - Solr ● Solr is delivered as – solr-4.2.1.zip ( 116M ) ● Copy file to your desired location ● Install each tar file as – unzip <zip file>
  • 6. Configuring Nutch Part 1 ● Assuming we will crawl a single web site ● Ensure that JAVA_HOME is set ● cd apache-nutch-1.6 ● Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> ● mkdir -p urls ; cd urls ; touch seed.txt
  • 7. Configuring Nutch Part 2 ● Add following url ( ours ) to seed.txt – http://www.semtech-solutions.co.nz ● Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*.)*semtech-solutions.co.nz/ ● This means that we will filter the urls found to only be from the local site
  • 8. Configuring Solr Part 1 ● cd solr-4.2.1/example/solr/collection1/conf ● Add some extra fields to schema.xml after _version_ field i.e.
  • 9. Start Solr Server – Part 1 ● Within solr-4.2.1/example ● Run the following command ● java -jar start.jar ● Now try to access admin web page for solr – http://localhost:8983/solr/admin ● You should now see the admin web site – ( see next page )
  • 10. Start Solr Server – Part 2 ● Solr Admin web page
  • 11. Run Nutch / Solr ● We are ready to crawl our first web site ● Go to apache-nutch-1.6 directory ● Run the following commands – touch nutch_start.bash – chmod 755 nutch_start.bash – vi nutch_start.bash ● Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 3 -topN 3
  • 12. Run Nutch / Solr ● Now run the nutch bash file – ./nutch_start.bash ● Select the Logging option on the admin console ● Monitor for errors in Logging console ● The crawl should finish with no errors and the line – Crawl finished: crawl – In the crawl window
  • 13. Check Crawled Data ● Now we check the data that we have crawled ● In Admin Console window – Set Core Selector to collection1 – Select the Query option – Click execute query button ● You should now see some of the data that you have crawled
  • 14. Crawled Data ● Crawled data in solr query
  • 15. Crawled Data ● Thats your first simple crawl completed ● Further reading at – http://nutch.apache.org – http://lucene.apache.org/solr ● Now you can – Add more urls to your seed.txt – Increase the depth of your link search via options ● -depth ● -topN – Modify your url filtering
  • 16. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems