Web scraping with nutch solr

•Descargar como ODP, PDF•

14 recomendaciones•15,560 vistas

Part 1 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Tecnología

Web Scraping Using Nutch and Solr
● A simple example of using open source code
● Web Scrape a single web site - ours
● Environment and code
– Using Centos V6.2 ( Linux )
– Apache Nutch 1.6
– Solr 4.2.1
– Java 1.6

Nutch and Solr Architecture
● Nutch processes urls and feeds content to Solr
● Solr indexes content

Where to get source code
● Nutch
– http://nutch.apache.org
● Solr
– http://lucene.apache.org/solr
● Java
– http://java.com

Installing Source - Nutch
● Nutch is delivered as
– apache-nutch-1.6-bin.tar ( 64M )
– apache-nutch-1.6-src.tar ( 20M )
● Copy each tar file to your desired location
● Install each tar file as
– tar xvf <tar file>
● Second tar file optional

Installing Source - Solr
● Solr is delivered as
– solr-4.2.1.zip ( 116M )
● Copy file to your desired location
● Install each tar file as
– unzip <zip file>

Configuring Nutch Part 1
● Assuming we will crawl a single web site
● Ensure that JAVA_HOME is set
● cd apache-nutch-1.6
● Edit agent name in conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
● mkdir -p urls ; cd urls ; touch seed.txt

Configuring Nutch Part 2
● Add following url ( ours ) to seed.txt
– http://www.semtech-solutions.co.nz
● Change url filtering in conf/regex-urlfilter.txt, change the line
– # accept anything else
– +.
– To be
– +^http://([a-z0-9]*.)*semtech-solutions.co.nz/
● This means that we will filter the urls found to only be from the
local site

Configuring Solr Part 1
● cd solr-4.2.1/example/solr/collection1/conf
● Add some extra fields to schema.xml after _version_ field i.e.

Start Solr Server – Part 1
● Within solr-4.2.1/example
● Run the following command
● java -jar start.jar
● Now try to access admin web page for solr
– http://localhost:8983/solr/admin
● You should now see the admin web site
– ( see next page )

Start Solr Server – Part 2
● Solr Admin web page

Run Nutch / Solr
● We are ready to crawl our first web site
● Go to apache-nutch-1.6 directory
● Run the following commands
– touch nutch_start.bash
– chmod 755 nutch_start.bash
– vi nutch_start.bash
● Add the text to the file
#!/bin/bash
bin/nutch crawl urls -solr http://localhost:8983/solr/
-dir crawl -depth 3 -topN 3

Run Nutch / Solr
● Now run the nutch bash file
– ./nutch_start.bash
● Select the Logging option on the admin console
● Monitor for errors in Logging console
● The crawl should finish with no errors and the line
– Crawl finished: crawl
– In the crawl window

Check Crawled Data
● Now we check the data that we have crawled
● In Admin Console window
– Set Core Selector to collection1
– Select the Query option
– Click execute query button
● You should now see some of the data that you have crawled

Crawled Data
● Crawled data in solr query

Crawled Data
● Thats your first simple crawl completed
● Further reading at
– http://nutch.apache.org
– http://lucene.apache.org/solr
● Now you can
– Add more urls to your seed.txt
– Increase the depth of your link search via options
● -depth
● -topN
– Modify your url filtering

Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Más contenido relacionado

La actualidad más candente

Caching. api. http 1.1Artjoker Digital

Nutch as a Web data mining platformabial

Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi

Web Crawling with Apache Nutchsebastian_nagel

HBaseConEast2016: HBase on Docker with ClusterdockMichael Stack

Hadoop Installation and basic configurationGerrit van Vuuren

Nutch - web-scale search engine toolkitabial

Distributed Data Processing Workshop - SBUAmir Sedighi

Perl Programming - 04 Programming DatabaseDanairat Thanabodithammachari

Apache HDFS - Lab AssignmentFarzad Nozarian

Shark - Lab AssignmentFarzad Nozarian

ChefWill Sterling

Install hadoop in a clusterXuhong Zhang

Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

An introduction To Apache SparkAmir Sedighi

Apache HBase - Lab AssignmentFarzad Nozarian

Introduction to Mesoskoboltmarky

Elasticsearch 1.x Cluster Installation (VirtualBox)Amir Sedighi

Advanced troubleshooting linux performanceForthscale

D8 configuration migrationViktor Likin

La actualidad más candente (20)

Caching. api. http 1.1

Nutch as a Web data mining platform

Hadoop 2.x HDFS Cluster Installation (VirtualBox)

Web Crawling with Apache Nutch

HBaseConEast2016: HBase on Docker with Clusterdock

Hadoop Installation and basic configuration

Nutch - web-scale search engine toolkit

Distributed Data Processing Workshop - SBU

Perl Programming - 04 Programming Database

Apache HDFS - Lab Assignment

Shark - Lab Assignment

Chef

Install hadoop in a cluster

Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab

An introduction To Apache Spark

Apache HBase - Lab Assignment

Introduction to Mesos

Elasticsearch 1.x Cluster Installation (VirtualBox)

Advanced troubleshooting linux performance

D8 configuration migration

Destacado

Migration from FAST ESP to SolrTNR Global

Solr installationZHAO Sam

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr

Building a Search Engine Using LuceneAbdelrahman Othman Helal

TriHUG: Lucene Solr HadoopGrant Ingersoll

Apache ManifoldCFPiergiorgio Lucidi

Frontera-Open Source Large Scale Web Crawling Frameworksixtyone

Building Satori: Web Data Extraction On HadoopNikolai Avteniev

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution

Large scale crawling with Apache NutchJulien Nioche

Apache Solr-WebinarEdureka!

Introduction to Apache Solr.ashish0x90

Frontera: open source, large scale web crawling frameworkScrapinghub

Solr+Hadoop = Big Data SearchCloudera, Inc.

Destacado (14)

Migration from FAST ESP to Solr

Solr installation

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...

Building a Search Engine Using Lucene

TriHUG: Lucene Solr Hadoop

Apache ManifoldCF

Frontera-Open Source Large Scale Web Crawling Framework

Building Satori: Web Data Extraction On Hadoop

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...

Large scale crawling with Apache Nutch

Apache Solr-Webinar

Introduction to Apache Solr.

Frontera: open source, large scale web crawling framework

Solr+Hadoop = Big Data Search

Similar a Web scraping with nutch solr

Apache1.pptwebhostingguy

Nginx as a Revers Proxy for Apache on Ubuntuabdullah roomi

Penetration Testing Boot CAMPShaikh Jamal Uddin l CISM, QRadar, Hack Card Recovery Expert

OpenNebula 5.4 Hands-on TutorialOpenNebula Project

les01.pdfVAMSICHOWDARY61

Android for Embedded Linux DevelopersOpersys inc.

ApacheMindtree

Setting up LAMP for Linux newbiesShabir Ahmad

Apache Performance Tuning: Scaling UpSander Temme

Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsMichael Zhang

Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsViSenze - Artificial Intelligence for the Visual Web

Sharded cluster tutorialAntonios Giannopoulos

MongoDB - Sharded Cluster TutorialJason Terpko

MongoDB – Sharded cluster tutorial - Percona Europe 2017Antonios Giannopoulos

Docker 1.11 @ Docker SF MeetupDocker, Inc.

Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby Michelle Antebi

Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael CrosbyDocker, Inc.

Deployment of WebObjects applications on CentOS LinuxWO Community

Crikeycon 2019 Velociraptor WorkshopVelocidex Enterprises

Virt monitoringAmadou tidiane Diallo

Similar a Web scraping with nutch solr (20)

Apache1.ppt

Nginx as a Revers Proxy for Apache on Ubuntu

Penetration Testing Boot CAMP

OpenNebula 5.4 Hands-on Tutorial

les01.pdf

Android for Embedded Linux Developers

Apache

Setting up LAMP for Linux newbies

Apache Performance Tuning: Scaling Up

Cobbler, Func and Puppet: Tools for Large Scale Environments

Sharded cluster tutorial

MongoDB - Sharded Cluster Tutorial

MongoDB – Sharded cluster tutorial - Percona Europe 2017

Docker 1.11 @ Docker SF Meetup

Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby

Deployment of WebObjects applications on CentOS Linux

Crikeycon 2019 Velociraptor Workshop

Virt monitoring

Más de Mike Frampton

Apache AiravataMike Frampton

Apache MADlib AI/MLMike Frampton

Apache MXNet AIMike Frampton

Apache GobblinMike Frampton

Apache Singa AIMike Frampton

Apache RangerMike Frampton

OrientDBMike Frampton

PrometheusMike Frampton

Apache TephraMike Frampton

Apache KuduMike Frampton

Apache BahirMike Frampton

Apache ArrowMike Frampton

JanusGraph DBMike Frampton

Apache IgniteMike Frampton

Apache SamzaMike Frampton

Apache FlinkMike Frampton

Apache EdgentMike Frampton

Apache CouchDBMike Frampton

An introduction to Apache MesosMike Frampton

An introduction to PentahoMike Frampton

Más de Mike Frampton (20)

Apache Airavata

Apache MADlib AI/ML

Apache MXNet AI

Apache Gobblin

Apache Singa AI

Apache Ranger

OrientDB

Prometheus

Apache Tephra

Apache Kudu

Apache Bahir

Apache Arrow

JanusGraph DB

Apache Ignite

Apache Samza

Apache Flink

Apache Edgent

Apache CouchDB

An introduction to Apache Mesos

An introduction to Pentaho

Último

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Scaling API-first – The story of a global engineering organizationRadu Cotescu

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

🐬 The future of MySQL is Postgres 🐘RTylerCroy

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Developing An App To Navigate The Roads of BrazilV3cube

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Web scraping with nutch solr

1. Web Scraping Using Nutch and Solr ● A simple example of using open source code ● Web Scrape a single web site - ours ● Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6

2. Nutch and Solr Architecture ● Nutch processes urls and feeds content to Solr ● Solr indexes content

3. Where to get source code ● Nutch – http://nutch.apache.org ● Solr – http://lucene.apache.org/solr ● Java – http://java.com

4. Installing Source - Nutch ● Nutch is delivered as – apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) ● Copy each tar file to your desired location ● Install each tar file as – tar xvf <tar file> ● Second tar file optional

5. Installing Source - Solr ● Solr is delivered as – solr-4.2.1.zip ( 116M ) ● Copy file to your desired location ● Install each tar file as – unzip <zip file>

6. Configuring Nutch Part 1 ● Assuming we will crawl a single web site ● Ensure that JAVA_HOME is set ● cd apache-nutch-1.6 ● Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> ● mkdir -p urls ; cd urls ; touch seed.txt

7. Configuring Nutch Part 2 ● Add following url ( ours ) to seed.txt – http://www.semtech-solutions.co.nz ● Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*.)*semtech-solutions.co.nz/ ● This means that we will filter the urls found to only be from the local site

8. Configuring Solr Part 1 ● cd solr-4.2.1/example/solr/collection1/conf ● Add some extra fields to schema.xml after _version_ field i.e.

9. Start Solr Server – Part 1 ● Within solr-4.2.1/example ● Run the following command ● java -jar start.jar ● Now try to access admin web page for solr – http://localhost:8983/solr/admin ● You should now see the admin web site – ( see next page )

10. Start Solr Server – Part 2 ● Solr Admin web page

11. Run Nutch / Solr ● We are ready to crawl our first web site ● Go to apache-nutch-1.6 directory ● Run the following commands – touch nutch_start.bash – chmod 755 nutch_start.bash – vi nutch_start.bash ● Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 3 -topN 3

12. Run Nutch / Solr ● Now run the nutch bash file – ./nutch_start.bash ● Select the Logging option on the admin console ● Monitor for errors in Logging console ● The crawl should finish with no errors and the line – Crawl finished: crawl – In the crawl window

13. Check Crawled Data ● Now we check the data that we have crawled ● In Admin Console window – Set Core Selector to collection1 – Select the Query option – Click execute query button ● You should now see some of the data that you have crawled

14. Crawled Data ● Crawled data in solr query

15. Crawled Data ● Thats your first simple crawl completed ● Further reading at – http://nutch.apache.org – http://lucene.apache.org/solr ● Now you can – Add more urls to your seed.txt – Increase the depth of your link search via options ● -depth ● -topN – Modify your url filtering

16. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

Web scraping with nutch solr

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Web scraping with nutch solr

Similar a Web scraping with nutch solr (20)

Más de Mike Frampton

Más de Mike Frampton (20)

Último

Último (20)

Web scraping with nutch solr