Internship

1
FLIS Service Investigation
Shotirose Poramesanaporn
NARA Institute of Science and Technology
Mahidol University

I Abstract
Nowadays, Free Live Streaming Service (FLIS) becomes extremely popular; therefore, the security of the
service should be determined. This paper presents information regarding FLIS service primarily based on
security issues which comes from an investigation of two distinct datasets – English and Thai. The aim of
doing the research was to analyze securities of FLIS service website by investigating and differentiating
the websites depending on a location service.

II Introduction
Since the Internet becomes worldwide, a number of Internet services have immensely increased. One of
the popular services was FLIS. FLIS or Free Live Streaming is a service that provides a free view of video
contents for any user. In fact, the broadcast signal probably has no consent from a content owner even
though the copyright of video contents might cost more than a billion; as a result, the service is
apparently considered as illegal.

Even though this kind of service is illegal, people might wonder why the number of service provider is
large. As a result, characteristics of the service should be determined. Besides, FLIS service is all around
the world with various service providers. The answer of whether there is any difference depending on
implemented area should be clarified. Consequently, the securities of the service could be analyzed.

Basically, the research was primarily depended on two research papers named, “It’s Free for a Reason:
Exploring the Ecosystem of Free Live Streaming Services” [1], and “Large-scale Security Analysis of the
Web: Challenges and Findings [2]”.

The major objective of the first paper was to clarify a website crawling method by using Selenium, a tool
for automating web applications for testing purposes. In fact, the tool could be used with several
languages, for example, Python, C++ or Java.

The primary objective of the second paper was to understand website’s securities. There are numerous
of website’s securities for a website to use in order to protect itself along with users.

2
A. Insight FLIS Service
In order to understand clearly about the service, it is
important to examined components and the way it works.
As shown in Figure 1, there are two main components in
the service, a channel provider and an aggregator. Firstly,
the channel provider works as a live signal receiver and
broadcaster. Next, the aggregator is a page, which
contains links to live signals. When a user clicks on any
provided link on the aggregator page, the user will be
redirected to a video player that covered by ads.
Afterwards, the money, which comes from Click-Per-
Thousand (CPT), Click-Per-Rate (CPR), Cost-Per-Click (CPC)
or clicking on fake close button, will flow to the ad
network, the aggregator, and the channel provider
respectively.

III Datasets
There are two separated datasets since the web pages were accumulated through www.google.com, in
English and in Thai keywords respectively. The process was done manually with the purpose of the
lowest false positive number could be acquired. Finally, each dataset contained 100 URLs and individual
URL became seed page.

IV Crawler
A crawler was created by Eclipse using Java language. The program was connected to MySQL through
Xampp. A crawler kept data of links, iframes, images, HTTP Security Headers which were composed of X-
Xss-Protection, X-Frame-Options, Content-Security-Policy, X-Content-Type-Options, Strict Transport-
Policy, Public-Key-Pins, HTTPS usage, website’s location; ip address, hostname, city, region, country,
coordinators, organization name and postal code, meta data including of vulnerability of each website.

V Data Aggregation
In order to collect data about links, iframes, and images, Selenium was used as a tool. The purpose of
aggregating the mentioned data was to understand the characteristic of the FLIS website.

For HTTP Security Headers, thanks to OWASP Secure Headers Project, the data could be retrieved
through Terminal by using `curl -L -A "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
Figure 1: Components of FLIS service, from a paper
called “It’s Free for a Reason: Exploring the Ecosystem
of Free Live Streaming Services”

3
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36" -s -D - (https://www.example.com) -o
/dev/null` command [3].

So do the website’s location, the data could be got through terminal with `curl ipinfo.io/(website’ ip
address)` command [4].

Additionally, HTTPS usage was collected by checking the connecting port along with redirecting of the
webpage. Since the URLs in datasets contained only "http://"; therefore if any page were using HTTPS,
redirection must be happened.

Furthermore, with the purpose of automatic dataset improvement, metadata of description and
keywords tags of each website were recorded via Selenium WebDriver. Each sentence of obtained data
was cut variously due to language structures. In order to be more precise, Regex was applied together
with manually analysis. As a result, all assembling data was counted and given a score regarding
frequency.

Besides, vulnerability of websites had been checked through Google Transparency Report individually
and recorded.

Finally, all data was collected in 2 separate databases for different datasets. Each database was divided
into several tables in order to reduce data redundancy.

VI Selenium, a crawling tool
As stated earlier, Selenium was used as a tool for data accumulation. Selenium is a set of different
software tools each with a different approach to supporting test automation [5]. Generally, there are 4
main types of Selenium; IDE, Remote Control (RC), WebDriver and Selenium Grid. By the way, Selenium
WebDriver was selected due to many reasons. Essentially, this type of the tool was flexible since it could
be used with Java unlike IDE. Next, Selenium Remote Control was depreciated and replaced with this
version instead. Furthermore, Selenium Grid had too many functions that were not necessary in this
research definitely.

VII Dataset Improvement
A. Metadata Aggregation
The aggregated metadata previously mentioned, at first, they were obtained in a string form, later, was
separated into words by specific separators depending on languages. For example, Thai dataset uses

4
spacebar ( ) while English one uses comma (,). However, some words also contain special character e.g. !
or . which might cause less accuracy in further analysis; therefore, Regex, a regular expression (regex or
regexp for short) a special text string for describing a search pattern [6], was applied. After that, each
word was manually checked.

B. Score Rating
Next, each keyword from datasets had been assigned the score; so the dataset could be automatically
improved in further processes. To assign the score for each word, the 10 highest redundant description
words with 5 highest keywords were computed. The calculation was done by summation of all
frequency results excluding stop words and websites’ names. After that, the frequency of each word will
be divided by the total frequency then multiply by 100 (Formula: Score for each word = Frequency /
Total frequency * 100).

Table 1 and Table 2 show 10th
highest redundant words; which were extracted from description of meta
tag of the seed pages, frequency, and calculated score for each word from English and Thai dataset
respectively.

Table 1: Calculated scores of extracted words from description meta tag of English dataset
Table 2: Calculated scores of extracted words from description meta tag of Thai dataset

5
Table 3 and Table 4 show 5th
highest redundant words; which were extracted from keyword of meta tag
of the seed pages, frequency, and calculated score for each word from English and Thai dataset
respectively.

C. Crawling Process
Later, the first two highest score keywords were used for searching in Google. This process was done
automatically through Selenium. At the beginning, keywords were placed in Google. Then, all the URLs
from the results of Google in every page were collected and recorded into the database. Lastly,
extracted meta descriptions and keywords of each URL were kept in the database as before.

D. Score Assignment
In order to calculate the score, the scores of metadata from seed pages were used. Words from new
aggregated pages that match with the collected metadata in database would be assigned the score
individually. At last, the score from keywords were summed up as a final score for every crawled URL. All
mentioned processes were done robotically through MySQL Workbench.
Table 3: Calculated scores of extracted words from keyword meta tag of English dataset
Table 4: Calculated scores of extracted words from keyword meta tag of Thai dataset

6
E. Fault Reduction
After getting a total score from each URL, in order to maintain accuracy, threshold was used. The page,
which has final score greater than 50 will be manually check again before adding into the dataset. The
reason of using 50 as a threshold value was since it generated a proper quantity of URL results with
acceptable false positive number.

Table 5 and Table 6 show the total number of the new pages that were automated crawl from Google.
At the beginning, the ‘Total list from Google’, which were derived from placing the two highest
redundant description keywords in Google, were recorded. After grading each word, the final score of
individual URL was obtained and threshold was applied and recorded as ‘Total list which was scored
more than 50’. Afterwards, the URL which was exactly same as the seed page was removed – ‘Total list
after removed redundancy’. Finally, each URL was checked manually and noted the total number into
the table as ‘Total list after manually check’.

VIII Research Result
A. General Characteristics
According to Table 7, FLIS service normally contains high number of links and images on the webpage;
nevertheless, low number of iframes (inline frames) [7]. This is due to the reason that this kind of
website are always composed of movie posters including the links, sometimes overlay ads, malicious
popup and advertisements, to illegal videos as shown in Figure 2 and Figure 3.

Table 5: Number of new added lists in each process from automated crawling through Google of English dataset
Table 6: Number of new added lists in each process from automated crawling through Google of Thai dataset
Table 7: Average number of links, iframes and images from English and Thai dataset

7
B. HTTP Security Header
One of the popular ways to implement website securities is to use HTTP Security Header [8].
Table 8 shows the number of HTTP Security Header usage of English and Thai dataset in
percentage. From observation, the popular HTTP Security Headers that FLIS service websites
usually implement were X-Frame-Options, X-Xss-Protection and X-Content-Type-Options
respectively. Moreover, from the results, each of a website always has same configuration [9] as
showing in ‘Setting’ column of Table 8.

From the gathered data, the results could be concluded as Thai dataset has higher HTTP
security header implementation than English one.
Figure 2: An example of FLIS service webpage of English dataset from the top list of Google.co.jp, http://vumoo.at.
Figure 3: An example of FLIS service webpage of Thai dataset, which contains overlay ads, malicious popup and
advertisements, https://www.nungmovies-hd.com.

8

C. HTTPS Usage
Hyper Text Transfer Protocol Secure (HTTPS) [11] usage test was conducted by checking port and
redirection. Figure 4 shows the respond code [12] from connecting to each URL of English and Thai
dataset. The x-axis represents respond port number from each URL connection while the y-axis shows
number of frequency of the particular port in percentage. The result shows that in both datasets, there
is neither any connection to HTTPS, port 443 [13], nor redirection of the webpage which clearly means
that none of FLIS service website used HTTPS.

D. Location
By investigating ip address of every URL in datasets; location of each could be retrieved. Figure 5 and
Figure 6 show research results. According to the figures, country code of each link indicates on x-axis
and the frequency in percentage is represented on y-axis. From the showing graph, majority of FLIS
service websites are located at US (United States of America) which probably causes by the location of
the organization, will be discussed on the next section, that the website depending on.

* [10]
Figure 4: The respond code from redirecting test from English and Thai dataset
Table 8: HTTP security header settings and implementations in percentage of English and Thai dataset

9

E. Organization
Table 9 and Table 10 show lists of organizations that URLs in datasets depending on and their
frequencies in percentages. Regarding to results, CloudFlare, Inc. has highest frequency in both datasets.
Figure 5: A country code that each URL locates and frequencies in percentages of English dataset
Figure 6: A country code that each URL locates and frequencies in percentages of Thai dataset

10

CloudFlare [14] is one of the most popular organization that websites depending on. This is due to the
reasons that this organization could speed up the relying websites including provide some security
supports.

As mentioned on previous section that majority of websites are located at US due to the location of
CloudFlare, Inc. Therefore, regarding to Domaintools [15], the location of domain registrant was tired to
discover. However, just some of them could be disclosed. Regarding Figure 7 and Figure 8, x-axis
indicates the country code of each URL under CloudFlare Inc. while the y-axis shows the frequency in
percentages.

Although the highest frequency of country still at US, some countries such as PA (Panama) and AU
(Australia) were also other popular locations which were hiding under the organization.

Table 9: A list of organizations and frequencies in percentages that each URL in English dataset relies on
Table 10: A list of organizations and frequencies in percentage that each URL in Thai dataset relies on

11

F. Vulnerabilities
From searching for vulnerability issue through Google Transparency Report, as showing in Figure 9, a
number of URLs in English and Thai dataset were reported as not dangerous. Only a few of them are
malicious owing to deceptive contents.
Figure 7: Locations that each domain registrant locates under Cloudflare organization represented by country code with the
frequencies in percentages of English dataset
Figure 8: Locations that each domain registrant locates under Cloudflare organization represented by country code with the
frequencies in percentages of Thai dataset
Figure 9: Vulnerability status of English and Thai dataset with frequencies obtained through Google Transparency Report

12
IX Conclusion
FLIS service website contains high number of iframes and images. Due to the research about HTTP
security headers, the popular ones are X-Frame-Options, X-Xss-Protection and X-Content-Type-Options
respectively. By the way, Thai dataset has higher HTTP Security Header implementation than English
one. However, none of website was implemented HTTPS. Numerous of the websites from both datasets
were located at US under CloudFlare Inc.

Regarding all investigations, FLIS service websites probably do not aim to attack users, as people might
understand. However, for the further research, this kind of service still could be examined more
regarding overlay ads, popup and advertisements, which could bring about some malicious issues.

In a nutshell, FLIS service websites do not have much differences regarding location and majority of
contents in FLIS service are not malicious.

X Problems
Nonetheless, a number of problems had been encountered. Regarding accumulating web pages,
searching for 100 seed pages without redundancy from Google was not an easy task since Google
generated the results based on the highest recalls and precisions. In addition, different website actually
had different coding method; therefore, sometimes it caused bug to the program. For example, a
number of websites used quotation mask inside crawled attributes, so MySQL got stuck with the
problem. Moreover, normally, both Eclipse and mySQL does not support Thai language; thereby an extra
configuration must be done. Besides, in order to collect all data, times and the Internet connection
became essential factors due to some websites needed lots of time to download as a result of a large
number of images. Additionally, some crawled web stopped working and was inaccessible unexpectedly
after aggregation, which might probably affect the crawling result. Furthermore, the results from two
datasets are unable to compare precisely since the number of from automated crawl datasets are not
equal.

XI Tradeoffs
An automatic crawler depends on the capability of Google since the crawler uses the search engine as a
tool for data accumulation. Besides, as a result of using metadata of original datasets, the improvement
of the crawler always depends on the existing data.

XII Acknowledgements
The author would like to thank Assistant Professor Doudou Fall for his kindness to advise and
guide the way throughout the research.

13
Works Cited
[1] [Online]. https://www.securitee.org/files/flis_ndss16.pdf
[2] [Online]. https://tom.vg/papers/eusec_trust2014.pdf
[3] OWASP. (2016, June) Welcome to OWASP. [Online]. https://www.owasp.org
[4] ipinfo.io. [Online]. http://ipinfo.io
[5] SeleniumHQ. [Online]. http://www.seleniumhq.org
[6] (2016, July) Regular-Expression.info. [Online]. http://www.regular-expressions.info
[7] w3schools. w3schools. [Online]. http://www.w3schools.com/html/html_iframe.asp
[8] Dionach. (2014, September) Dionach. [Online]. https://www.dionach.com/blog/an-overview-
of-http-security-headers
[9] Isaac Dawson. (2014, March) Veracode. [Online].
https://www.veracode.com/blog/2014/03/guidelines-for-setting-security-headers
[10] Foundeo, Inc. 2012-1016. Content Security Policy (CSP) Quick Reference Guide. [Online].
https://content-security-policy.com
[11] Comodo. (2016) Instant SSL. [Online]. https://www.instantssl.com/ssl-certificate-
products/https.html
[12] RFC 2616 Fielding, et al. 10 Status Code Definitions. [Online].
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
[13] Python Software Foundation. (2016, June) HTTP protocol client. [Online].
https://docs.python.org/3/library/http.client.html
[14] Cloudflare. [Online]. https://www.cloudflare.com
[15] Domaintools. [Online]. http://whois.domaintools.com/couchtuner.ag

Internship

Recomendados

Recomendados

Más contenido relacionado

Similar a Internship

Similar a Internship (20)

Internship