Web mining tools based on content mining,usage mining and structure mining. Tools : Tableau,R, Octoparse , Scrapy, Hits and Pagerank algo. also included.
2. Web Mining
Web mining is the use of data mining techniques
to automatically discover and extract information
from Web documents and services.
3 Types:
1. Web usage mining
2. Web content mining
3. Web structure mining
3. Web usage mining
Web usage mining is a process of identifying or discovering patterns from large
data sets and these patterns enable you to predict user behaviors.
Tools :
1. Tableau
2. R
4. Tableau
➔Tableau offers a family of interactive data
Visualization products focused on business
intelligence
➔Transforming data into visualization
➔This process takes only seconds or minutes
With the help of drag-and-drop interface
Official Website : http://www.tableau.com/
5. R
➔It’s a free software programming language and
software environment for statistical computing
And graphics.
➔The R language is widely used among data miners
for developing statistical software and data
analysis
➔Ease of use and extensibility has raised R’s
popularity substantially in recent years
6. Web content mining
Web content mining is a process of collecting useful data from websites.
This content includes news, comments, company information, product catalogs,
etc.
Tools :
1. Octoparse
2. Scrapy
7. Octoparse
➔Octoparse is a simple but powerful web data mining tool
that automates web data extraction.
➔It allows you to create highly accurate extraction rules
➔The extraction rule would tell Octoparse:
➢which website Is to be open
➢where is the data you plan to crawl;
➢what kind of data you want etc.
Official Website : http://www.octoparse.com/
8. Scrapy
➔Scrapy is an open source and framework for collect data
from websites.
➔It is written in Python and you can
write the rules to extract web data.
➔Supported Operating Systems:
Linux, Windows, Mac and BSD
Official Website : https://scrapy.org/
9. Web structure mining
Web structure mining is also known as link mining.
It is a process to discover the relationship between web pages linked by
information or direct link connection.
Tools :
1. HITS algorithm
2. PageRank Algorithm
10. Hyperlink-Induced Topic Search(HITS) algorithm
➔Also known as hubs and authorities is a link analysis algorithm that rates Web
pages
➔ Uses root set(most relevant pages returned by text-based algo.)
➔ Generate base set = root set + web pages that are linked from it and pages
that link to it
11. PageRank Algorithm
➔PageRank is an algorithm used by Google Search
to rank websites in their search engine results.
➔PageRank was named after Larry Page(one of
The founders of Google)
➔It assigns a numerical weighting to each element of
a hyperlinked set of documents with the purpose
of "measuring" its relative importance within the set
12. References
★ 7 Web Mining Tools Around the Web
http://www.octoparse.com/blog/7-web-mining-tools-around-the-web/
★ Web mining Information : Wiki
https://en.wikipedia.org/wiki/Web_mining
★ HITS and PageRank Algorithm pdf