2. 2
A. Insight FLIS Service
In order to understand clearly about the service, it is
important to examined components and the way it works.
As shown in Figure 1, there are two main components in
the service, a channel provider and an aggregator. Firstly,
the channel provider works as a live signal receiver and
broadcaster. Next, the aggregator is a page, which
contains links to live signals. When a user clicks on any
provided link on the aggregator page, the user will be
redirected to a video player that covered by ads.
Afterwards, the money, which comes from Click-Per-
Thousand (CPT), Click-Per-Rate (CPR), Cost-Per-Click (CPC)
or clicking on fake close button, will flow to the ad
network, the aggregator, and the channel provider
respectively.
III Datasets
There are two separated datasets since the web pages were accumulated through www.google.com, in
English and in Thai keywords respectively. The process was done manually with the purpose of the
lowest false positive number could be acquired. Finally, each dataset contained 100 URLs and individual
URL became seed page.
IV Crawler
A crawler was created by Eclipse using Java language. The program was connected to MySQL through
Xampp. A crawler kept data of links, iframes, images, HTTP Security Headers which were composed of X-
Xss-Protection, X-Frame-Options, Content-Security-Policy, X-Content-Type-Options, Strict Transport-
Policy, Public-Key-Pins, HTTPS usage, website’s location; ip address, hostname, city, region, country,
coordinators, organization name and postal code, meta data including of vulnerability of each website.
V Data Aggregation
In order to collect data about links, iframes, and images, Selenium was used as a tool. The purpose of
aggregating the mentioned data was to understand the characteristic of the FLIS website.
For HTTP Security Headers, thanks to OWASP Secure Headers Project, the data could be retrieved
through Terminal by using `curl -L -A "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
Figure 1: Components of FLIS service, from a paper
called “It’s Free for a Reason: Exploring the Ecosystem
of Free Live Streaming Services”
3. 3
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36" -s -D - (https://www.example.com) -o
/dev/null` command [3].
So do the website’s location, the data could be got through terminal with `curl ipinfo.io/(website’ ip
address)` command [4].
Additionally, HTTPS usage was collected by checking the connecting port along with redirecting of the
webpage. Since the URLs in datasets contained only "http://"; therefore if any page were using HTTPS,
redirection must be happened.
Furthermore, with the purpose of automatic dataset improvement, metadata of description and
keywords tags of each website were recorded via Selenium WebDriver. Each sentence of obtained data
was cut variously due to language structures. In order to be more precise, Regex was applied together
with manually analysis. As a result, all assembling data was counted and given a score regarding
frequency.
Besides, vulnerability of websites had been checked through Google Transparency Report individually
and recorded.
Finally, all data was collected in 2 separate databases for different datasets. Each database was divided
into several tables in order to reduce data redundancy.
VI Selenium, a crawling tool
As stated earlier, Selenium was used as a tool for data accumulation. Selenium is a set of different
software tools each with a different approach to supporting test automation [5]. Generally, there are 4
main types of Selenium; IDE, Remote Control (RC), WebDriver and Selenium Grid. By the way, Selenium
WebDriver was selected due to many reasons. Essentially, this type of the tool was flexible since it could
be used with Java unlike IDE. Next, Selenium Remote Control was depreciated and replaced with this
version instead. Furthermore, Selenium Grid had too many functions that were not necessary in this
research definitely.
VII Dataset Improvement
A. Metadata Aggregation
The aggregated metadata previously mentioned, at first, they were obtained in a string form, later, was
separated into words by specific separators depending on languages. For example, Thai dataset uses
5. 5
Table 3 and Table 4 show 5th
highest redundant words; which were extracted from keyword of meta tag
of the seed pages, frequency, and calculated score for each word from English and Thai dataset
respectively.
C. Crawling Process
Later, the first two highest score keywords were used for searching in Google. This process was done
automatically through Selenium. At the beginning, keywords were placed in Google. Then, all the URLs
from the results of Google in every page were collected and recorded into the database. Lastly,
extracted meta descriptions and keywords of each URL were kept in the database as before.
D. Score Assignment
In order to calculate the score, the scores of metadata from seed pages were used. Words from new
aggregated pages that match with the collected metadata in database would be assigned the score
individually. At last, the score from keywords were summed up as a final score for every crawled URL. All
mentioned processes were done robotically through MySQL Workbench.
Table 3: Calculated scores of extracted words from keyword meta tag of English dataset
Table 4: Calculated scores of extracted words from keyword meta tag of Thai dataset
7. 7
B. HTTP Security Header
One of the popular ways to implement website securities is to use HTTP Security Header [8].
Table 8 shows the number of HTTP Security Header usage of English and Thai dataset in
percentage. From observation, the popular HTTP Security Headers that FLIS service websites
usually implement were X-Frame-Options, X-Xss-Protection and X-Content-Type-Options
respectively. Moreover, from the results, each of a website always has same configuration [9] as
showing in ‘Setting’ column of Table 8.
From the gathered data, the results could be concluded as Thai dataset has higher HTTP
security header implementation than English one.
Figure 2: An example of FLIS service webpage of English dataset from the top list of Google.co.jp, http://vumoo.at.
Figure 3: An example of FLIS service webpage of Thai dataset, which contains overlay ads, malicious popup and
advertisements, https://www.nungmovies-hd.com.
8. 8
C. HTTPS Usage
Hyper Text Transfer Protocol Secure (HTTPS) [11] usage test was conducted by checking port and
redirection. Figure 4 shows the respond code [12] from connecting to each URL of English and Thai
dataset. The x-axis represents respond port number from each URL connection while the y-axis shows
number of frequency of the particular port in percentage. The result shows that in both datasets, there
is neither any connection to HTTPS, port 443 [13], nor redirection of the webpage which clearly means
that none of FLIS service website used HTTPS.
D. Location
By investigating ip address of every URL in datasets; location of each could be retrieved. Figure 5 and
Figure 6 show research results. According to the figures, country code of each link indicates on x-axis
and the frequency in percentage is represented on y-axis. From the showing graph, majority of FLIS
service websites are located at US (United States of America) which probably causes by the location of
the organization, will be discussed on the next section, that the website depending on.
* [10]
Figure 4: The respond code from redirecting test from English and Thai dataset
Table 8: HTTP security header settings and implementations in percentage of English and Thai dataset
9. 9
E. Organization
Table 9 and Table 10 show lists of organizations that URLs in datasets depending on and their
frequencies in percentages. Regarding to results, CloudFlare, Inc. has highest frequency in both datasets.
Figure 5: A country code that each URL locates and frequencies in percentages of English dataset
Figure 6: A country code that each URL locates and frequencies in percentages of Thai dataset
10. 10
CloudFlare [14] is one of the most popular organization that websites depending on. This is due to the
reasons that this organization could speed up the relying websites including provide some security
supports.
As mentioned on previous section that majority of websites are located at US due to the location of
CloudFlare, Inc. Therefore, regarding to Domaintools [15], the location of domain registrant was tired to
discover. However, just some of them could be disclosed. Regarding Figure 7 and Figure 8, x-axis
indicates the country code of each URL under CloudFlare Inc. while the y-axis shows the frequency in
percentages.
Although the highest frequency of country still at US, some countries such as PA (Panama) and AU
(Australia) were also other popular locations which were hiding under the organization.
Table 9: A list of organizations and frequencies in percentages that each URL in English dataset relies on
Table 10: A list of organizations and frequencies in percentage that each URL in Thai dataset relies on
12. 12
IX Conclusion
FLIS service website contains high number of iframes and images. Due to the research about HTTP
security headers, the popular ones are X-Frame-Options, X-Xss-Protection and X-Content-Type-Options
respectively. By the way, Thai dataset has higher HTTP Security Header implementation than English
one. However, none of website was implemented HTTPS. Numerous of the websites from both datasets
were located at US under CloudFlare Inc.
Regarding all investigations, FLIS service websites probably do not aim to attack users, as people might
understand. However, for the further research, this kind of service still could be examined more
regarding overlay ads, popup and advertisements, which could bring about some malicious issues.
In a nutshell, FLIS service websites do not have much differences regarding location and majority of
contents in FLIS service are not malicious.
X Problems
Nonetheless, a number of problems had been encountered. Regarding accumulating web pages,
searching for 100 seed pages without redundancy from Google was not an easy task since Google
generated the results based on the highest recalls and precisions. In addition, different website actually
had different coding method; therefore, sometimes it caused bug to the program. For example, a
number of websites used quotation mask inside crawled attributes, so MySQL got stuck with the
problem. Moreover, normally, both Eclipse and mySQL does not support Thai language; thereby an extra
configuration must be done. Besides, in order to collect all data, times and the Internet connection
became essential factors due to some websites needed lots of time to download as a result of a large
number of images. Additionally, some crawled web stopped working and was inaccessible unexpectedly
after aggregation, which might probably affect the crawling result. Furthermore, the results from two
datasets are unable to compare precisely since the number of from automated crawl datasets are not
equal.
XI Tradeoffs
An automatic crawler depends on the capability of Google since the crawler uses the search engine as a
tool for data accumulation. Besides, as a result of using metadata of original datasets, the improvement
of the crawler always depends on the existing data.
XII Acknowledgements
The author would like to thank Assistant Professor Doudou Fall for his kindness to advise and
guide the way throughout the research.