1. Mẫu viết đặc tả chứ năng trong tài liệu SRS
c
1. Collect and extract web data
The overview of this method is about to extract 4 types of data from any website. We are
using DOM structure to extract data.
Figure 1 - Collect and extract data
…….[Viết mô tả ngắn gọi của hình 1]
The following figure is shown the extracting data flow:
Figure 2 - Extracting data flow
…….[Viết mô tả ngắn gọi của hình 2]
The source-web model:
2. Figure 3 - Source web model
…….[Viết mô tả ngắn gọi của hình 3]
Model of storing data:
Figure 4 - Model of storing data
…….[Viết mô tả ngắn gọi của hình 4]
2. Web data mining function specification
a. getSourceWebFromURL() function
b.
ID GS-01
Input String:URL–Addressof awebsite
Output String:HTML-ResourceviaHTTP
Description Using opensource HttpClient to create an object HttpClient, which receives
data from input URL in format of text/html.
c. craw_index() function
d.
ID CI-01
Input Result ofGS-01
Output Name of category and its path
Description From result of GS-01,using object Jsoupin order to collect content
that has been defined in fileconfig. The path of each category is crawled
in CC-01
e. craw_category() function
f.
ID CC-01
Input Result ofCI-01
Output Path to each of news link
3. Description From result of CI-01that contains a link to news from Internet. It is stored by
HTML format and used Jsoup. All contents are defined in fileconfig. The
content from each link is implement in CT-01
g. craw_content() function
h.
ID CT-01
Input Result ofCC-01
Output Content including Title, Image Link, Description, and Content (text data)
Description There will be a group of links that is crawled by using GS-1. The content is
under HTML format. We extract 4 needed data types and store into
database.