SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
1
FLIS	Service	Investigation	
Shotirose	Poramesanaporn	
NARA	Institute	of	Science	and	Technology	
Mahidol	University	
	
I	Abstract	
Nowadays,	Free	Live	Streaming	Service	(FLIS)	becomes	extremely	popular;	therefore,	the	security	of	the	
service	should	be	determined.	This	paper	presents	information	regarding	FLIS	service	primarily	based	on	
security	issues	which	comes	from	an	investigation	of	two	distinct	datasets	–	English	and	Thai.	The	aim	of	
doing	the	research	was	to	analyze	securities	of	FLIS	service	website	by	investigating	and	differentiating	
the	websites	depending	on	a	location	service.	
	
II	Introduction	
Since	the	Internet	becomes	worldwide,	a	number	of	Internet	services	have	immensely	increased.	One	of	
the	popular	services	was	FLIS.	FLIS	or	Free	Live	Streaming	is	a	service	that	provides	a	free	view	of	video	
contents	for	any	user.	In	fact,	the	broadcast	signal	probably	has	no	consent	from	a	content	owner	even	
though	 the	 copyright	 of	 video	 contents	 might	 cost	 more	 than	 a	 billion;	 as	 a	 result,	 the	 service	 is	
apparently	considered	as	illegal.	
	
Even	though	this	kind	of	service	is	illegal,	people	might	wonder	why	the	number	of	service	provider	is	
large.	As	a	result,	characteristics	of	the	service	should	be	determined.	Besides,	FLIS	service	is	all	around	
the	world	with	various	service	providers.	The	answer	of	whether	there	is	any	difference	depending	on	
implemented	area	should	be	clarified.	Consequently,	the	securities	of	the	service	could	be	analyzed.	
	
Basically,	the	research	was	primarily	depended	on	two	research	papers	named,	“It’s	Free	for	a	Reason:	
Exploring	the	Ecosystem	of	Free	Live	Streaming	Services”	[1],	and	“Large-scale	Security	Analysis	of	the	
Web:	Challenges	and	Findings	[2]”.		
	
The	major	objective	of	the	first	paper	was	to	clarify	a	website	crawling	method	by	using	Selenium,	a	tool	
for	 automating	 web	 applications	 for	 testing	 purposes.	 In	 fact,	 the	 tool	 could	 be	 used	 with	 several	
languages,	for	example,	Python,	C++	or	Java.	
	
The	primary	objective	of	the	second	paper	was	to	understand	website’s	securities.	There	are	numerous	
of	website’s	securities	for	a	website	to	use	in	order	to	protect	itself	along	with	users.
2
A.	Insight	FLIS	Service	
In	 order	 to	 understand	 clearly	 about	 the	 service,	 it	 is	
important	to	examined	components	and	the	way	it	works.	
As	shown	in	Figure	1,	there	are	two	main	components	in	
the	service,	a	channel	provider	and	an	aggregator.	Firstly,	
the	 channel	 provider	 works	 as	 a	 live	 signal	 receiver	 and	
broadcaster.	 Next,	 the	 aggregator	 is	 a	 page,	 which	
contains	 links	 to	 live	 signals.	 When	 a	 user	 clicks	 on	 any	
provided	 link	 on	 the	 aggregator	 page,	 the	 user	 will	 be	
redirected	 to	 a	 video	 player	 that	 covered	 by	 ads.	
Afterwards,	 the	 money,	 which	 comes	 from	 Click-Per-
Thousand	(CPT),	Click-Per-Rate	(CPR),	Cost-Per-Click	(CPC)	
or	 clicking	 on	 fake	 close	 button,	 will	 flow	 to	 the	 ad	
network,	 the	 aggregator,	 and	 the	 channel	 provider	
respectively.	
	
III	Datasets	
There	are	two	separated	datasets	since	the	web	pages	were	accumulated	through	www.google.com,	in	
English	 and	 in	 Thai	 keywords	 respectively.	 The	 process	 was	 done	 manually	 with	 the	 purpose	 of	 the	
lowest	false	positive	number	could	be	acquired.	Finally,	each	dataset	contained	100	URLs	and	individual	
URL	became	seed	page.	
	
IV	Crawler	
A	crawler	was	created	by	Eclipse	using	Java	language.	The	program	was	connected	to	MySQL	through	
Xampp.	A	crawler	kept	data	of	links,	iframes,	images,	HTTP	Security	Headers	which	were	composed	of	X-
Xss-Protection,	 X-Frame-Options,	 Content-Security-Policy,	 X-Content-Type-Options,	 Strict	 Transport-
Policy,	 Public-Key-Pins,	 HTTPS	 usage,	 website’s	 location;	 ip	 address,	 hostname,	 city,	 region,	 country,	
coordinators,	organization	name	and	postal	code,	meta	data	including	of	vulnerability	of	each	website.	
	
V	Data	Aggregation	
In	order	to	collect	data	about	links,	iframes,	and	images,	Selenium	was	used	as	a	tool.	The	purpose	of	
aggregating	the	mentioned	data	was	to	understand	the	characteristic	of	the	FLIS	website.	
	
For	 HTTP	 Security	 Headers,	 thanks	 to	 OWASP	 Secure	 Headers	 Project,	 the	 data	 could	 be	 retrieved	
through	 Terminal	 by	 using	 `curl	 -L	 -A	 "Mozilla/5.0	 (Windows	 NT	 6.1;	 WOW64)	 AppleWebKit/537.36	
Figure	1:	Components	of	FLIS	service,	from	a	paper	
called	“It’s	Free	for	a	Reason:	Exploring	the	Ecosystem	
of	Free	Live	Streaming	Services”
3
(KHTML,	 like	 Gecko)	 Chrome/50.0.2661.102	 Safari/537.36"	 -s	 -D	 -	 (https://www.example.com)	 -o	
/dev/null`	command	[3].	
	
So	do	the	website’s	location,	the	data	could	be	got	through	terminal	with	`curl	ipinfo.io/(website’	ip	
address)`	command	[4].	
	
Additionally,	HTTPS	usage	was	collected	by	checking	the	connecting	port	along	with	redirecting	of	the	
webpage.	Since	the	URLs	in	datasets	contained	only	"http://";	therefore	if	any	page	were	using	HTTPS,	
redirection	must	be	happened.	
	
Furthermore,	 with	 the	 purpose	 of	 automatic	 dataset	 improvement,	 metadata	 of	 description	 and	
keywords	tags	of	each	website	were	recorded	via	Selenium	WebDriver.	Each	sentence	of	obtained	data	
was	cut	variously	due	to	language	structures.	In	order	to	be	more	precise,	Regex	was	applied	together	
with	 manually	 analysis.	 As	 a	 result,	 all	 assembling	 data	 was	 counted	 and	 given	 a	 score	 regarding	
frequency.	
	
Besides,	vulnerability	of	websites	had	been	checked	through	Google	Transparency	Report	individually	
and	recorded.	
	
Finally,	all	data	was	collected	in	2	separate	databases	for	different	datasets.	Each	database	was	divided	
into	several	tables	in	order	to	reduce	data	redundancy.		
	
VI	Selenium,	a	crawling	tool	
As	 stated	 earlier,	 Selenium	 was	 used	 as	 a	 tool	 for	 data	 accumulation.	 Selenium	 is	 a	 set	 of	 different	
software	tools	each	with	a	different	approach	to	supporting	test	automation	[5].	Generally,	there	are	4	
main	types	of	Selenium;	IDE,	Remote	Control	(RC),	WebDriver	and	Selenium	Grid.	By	the	way,	Selenium	
WebDriver	was	selected	due	to	many	reasons.	Essentially,	this	type	of	the	tool	was	flexible	since	it	could	
be	used	with	Java	unlike	IDE.	Next,	Selenium	Remote	Control	was	depreciated	and	replaced	with	this	
version	instead.	Furthermore,	Selenium	Grid	had	too	many	functions	that	were	not	necessary	in	this	
research	definitely.	
	
VII	Dataset	Improvement	
A.	Metadata	Aggregation	
The	aggregated	metadata	previously	mentioned,	at	first,	they	were	obtained	in	a	string	form,	later,	was	
separated	into	words	by	specific	separators	depending	on	languages.	For	example,	Thai	dataset	uses
4
spacebar	(	)	while	English	one	uses	comma	(,).	However,	some	words	also	contain	special	character	e.g.	!	
or	.	which	might	cause	less	accuracy	in	further	analysis;	therefore,	Regex,	a	regular	expression	(regex	or	
regexp	for	short)	a	special	text	string	for	describing	a	search	pattern	[6],	was	applied.	After	that,	each	
word	was	manually	checked.	
	
B.	Score	Rating	
Next,	each	keyword	from	datasets	had	been	assigned	the	score;	so	the	dataset	could	be	automatically	
improved	in	further	processes.	To	assign	the	score	for	each	word,	the	10	highest	redundant	description	
words	 with	 5	 highest	 keywords	 were	 computed.	 The	 calculation	 was	 done	 by	 summation	 of	 all	
frequency	results	excluding	stop	words	and	websites’	names.	After	that,	the	frequency	of	each	word	will	
be	divided	by	the	total	frequency	then	multiply	by	100	(Formula:	Score	for	each	word	=	Frequency	/	
Total	frequency	*	100).	
	
Table	1	and	Table	2	show	10th
	highest	redundant	words;	which	were	extracted	from	description	of	meta	
tag	 of	 the	 seed	 pages,	 frequency,	 and	 calculated	 score	 for	 each	 word	from	 English	 and	 Thai	 dataset	
respectively.	
	
Table	1:	Calculated	scores	of	extracted	words	from	description	meta	tag	of	English	dataset	
Table	2:	Calculated	scores	of	extracted	words	from	description	meta	tag	of	Thai	dataset
5
Table	3	and	Table	4	show	5th
	highest	redundant	words;	which	were	extracted	from	keyword	of	meta	tag	
of	 the	 seed	 pages,	 frequency,	 and	 calculated	 score	 for	 each	 word	 from	 English	 and	 Thai	 dataset	
respectively.	
	
C.	Crawling	Process	
Later,	the	first	two	highest	score	keywords	were	used	for	searching	in	Google.	This	process	was	done	
automatically	through	Selenium.	At	the	beginning,	keywords	were	placed	in	Google.	Then,	all	the	URLs	
from	 the	 results	 of	 Google	 in	 every	 page	 were	 collected	 and	 recorded	 into	 the	 database.	 Lastly,	
extracted	meta	descriptions	and	keywords	of	each	URL	were	kept	in	the	database	as	before.	
	
D.	Score	Assignment	
In	order	to	calculate	the	score,	the	scores	of	metadata	from	seed	pages	were	used.	Words	from	new	
aggregated	 pages	 that	 match	 with	 the	 collected	 metadata	 in	 database	 would	 be	 assigned	 the	 score	
individually.	At	last,	the	score	from	keywords	were	summed	up	as	a	final	score	for	every	crawled	URL.	All	
mentioned	processes	were	done	robotically	through	MySQL	Workbench.		
Table	3:	Calculated	scores	of	extracted	words	from	keyword	meta	tag	of	English	dataset	
Table	4:	Calculated	scores	of	extracted	words	from	keyword	meta	tag	of	Thai	dataset
6
E.	Fault	Reduction	
After	getting	a	total	score	from	each	URL,	in	order	to	maintain	accuracy,	threshold	was	used.	The	page,	
which	has	final	score	greater	than	50	will	be	manually	check	again	before	adding	into	the	dataset.	The	
reason	of	using	50	as	a	threshold	value	was	since	it	generated	a	proper	quantity	of	URL	results	with	
acceptable	false	positive	number.	
	
Table	5	and	Table	6	show	the	total	number	of	the	new	pages	that	were	automated	crawl	from	Google.	
At	 the	 beginning,	 the	 ‘Total	 list	 from	 Google’,	 which	 were	 derived	 from	 placing	 the	 two	 highest	
redundant	description	keywords	in	Google,	were	recorded.	After	grading	each	word,	the	final	score	of	
individual	URL	was	obtained	and	threshold	was	applied	and	recorded	as	‘Total	list	which	was	scored	
more	than	50’.	Afterwards,	the	URL	which	was	exactly	same	as	the	seed	page	was	removed	–	‘Total	list	
after	removed	redundancy’.	Finally,	each	URL	was	checked	manually	and	noted	the	total	number	into	
the	table	as	‘Total	list	after	manually	check’.	
	
	
	
	
	
	
VIII	Research	Result	
A.	General	Characteristics	
According	to	Table	7,	FLIS	service	normally	contains	high	number	of	links	and	images	on	the	webpage;	
nevertheless,	 low	 number	 of	 iframes	 (inline	 frames)	 [7].	 	 This	 is	 due	 to	 the	 reason	 that	 this	 kind	 of	
website	are	always	composed	of	movie	posters	including	the	links,	sometimes	overlay	ads,	malicious	
popup	and	advertisements,	to	illegal	videos	as	shown	in	Figure	2	and	Figure	3.	
	
	
	
Table	5:	Number	of	new	added	lists	in	each	process	from	automated	crawling	through	Google	of	English	dataset	
Table	6:	Number	of	new	added	lists	in	each	process	from	automated	crawling	through	Google	of	Thai	dataset	
Table	7:	Average	number	of	links,	iframes	and	images	from	English	and	Thai	dataset
7
B.	HTTP	Security	Header	
One	of	the	popular	ways	to	implement	website	securities	is	to	use	HTTP	Security	Header	[8].	
Table	 8	 shows	 the	 number	 of	 HTTP	 Security	 Header	 usage	 of	 English	 and	 Thai	 dataset	 in	
percentage.	From	observation,	the	popular	HTTP	Security	Headers	that	FLIS	service	websites	
usually	 implement	 were	 X-Frame-Options,	 X-Xss-Protection	 and	 X-Content-Type-Options	
respectively.	Moreover,	from	the	results,	each	of	a	website	always	has	same	configuration	[9]	as	
showing	in	‘Setting’	column	of	Table	8.	
	
From	 the	 gathered	 data,	 the	 results	 could	 be	 concluded	 as	 Thai	 dataset	 has	 higher	 HTTP	
security	header	implementation	than	English	one.	
Figure	2:	An	example	of	FLIS	service	webpage	of	English	dataset	from	the	top	list	of	Google.co.jp,	http://vumoo.at.	
Figure	3:	An	example	of	FLIS	service	webpage	of	Thai	dataset,	which	contains	overlay	ads,	malicious	popup	and	
advertisements,	https://www.nungmovies-hd.com.
8
	
C.	HTTPS	Usage	
Hyper	 Text	 Transfer	 Protocol	 Secure	 (HTTPS)	 [11]	 usage	 test	 was	 conducted	 by	 checking	 port	 and	
redirection.	 Figure	 4	 shows	 the	 respond	 code	 [12]	 from	 connecting	 to	 each	 URL	 of	 English	 and	 Thai	
dataset.	The	x-axis	represents	respond	port	number	from	each	URL	connection	while	the	y-axis	shows	
number	of	frequency	of	the	particular	port	in	percentage.	The	result	shows	that	in	both	datasets,	there	
is	neither	any	connection	to	HTTPS,	port	443	[13],	nor	redirection	of	the	webpage	which	clearly	means	
that	none	of	FLIS	service	website	used	HTTPS.	
	
D.	Location	
By	investigating	ip	address	of	every	URL	in	datasets;	location	of	each	could	be	retrieved.	Figure	5	and	
Figure	6	show	research	results.	According	to	the	figures,	country	code	of	each	link	indicates	on	x-axis	
and	 the	 frequency	 in	 percentage	 is	 represented	 on	 y-axis.	 From	 the	 showing	 graph,	 majority	 of	 FLIS	
service	websites	are	located	at	US	(United	States	of	America)	which	probably	causes	by	the	location	of	
the	organization,	will	be	discussed	on	the	next	section,	that	the	website	depending	on.	
	
	
	
*	[10]	
Figure	4:	The	respond	code	from	redirecting	test	from	English	and	Thai	dataset	
Table	8:	HTTP	security	header	settings	and	implementations	in	percentage	of	English	and	Thai	dataset
9
	
	
E.	Organization	
Table	 9	 and	 Table	 10	 show	 lists	 of	 organizations	 that	 URLs	 in	 datasets	 depending	 on	 and	 their	
frequencies	in	percentages.	Regarding	to	results,	CloudFlare,	Inc.	has	highest	frequency	in	both	datasets.	
Figure	5:	A	country	code	that	each	URL	locates	and	frequencies	in	percentages	of	English	dataset	
Figure	6:	A	country	code	that	each	URL	locates	and	frequencies	in	percentages	of	Thai	dataset
10
	
	
	
	
	
	
	
	
	
	
	
	
	
CloudFlare	[14]	is	one	of	the	most	popular	organization	that	websites	depending	on.	This	is	due	to	the	
reasons	 that	 this	 organization	 could	 speed	 up	 the	 relying	 websites	 including	 provide	 some	 security	
supports.	
	
As	mentioned	on	previous	section	that	majority	of	websites	are	located	at	US	due	to	the	location	of	
CloudFlare,	Inc.	Therefore,	regarding	to	Domaintools	[15],	the	location	of	domain	registrant	was	tired	to	
discover.	 However,	 just	 some	 of	 them	 could	 be	 disclosed.	 Regarding	 Figure	 7	 and	 Figure	 8,	 x-axis	
indicates	the	country	code	of	each	URL	under	CloudFlare	Inc.	while	the	y-axis	shows	the	frequency	in	
percentages.		
	
Although	 the	 highest	 frequency	 of	 country	 still	 at	 US,	 some	 countries	 such	 as	 PA	 (Panama)	 and	 AU	
(Australia)	were	also	other	popular	locations	which	were	hiding	under	the	organization.	
	
Table	9:	A	list	of	organizations	and	frequencies	in	percentages	that	each	URL	in	English	dataset	relies	on	
Table 10: A list of organizations and frequencies in percentage that each URL in Thai dataset relies on
11
	
	
	
	
	
	
	
	
	
F.	Vulnerabilities	
From	searching	for	vulnerability	issue	through	Google	Transparency	Report,	as	showing	in	Figure	9,	a	
number	of	URLs	in	English	and	Thai	dataset	were	reported	as	not	dangerous.	Only	a	few	of	them	are	
malicious	owing	to	deceptive	contents.	
Figure	7:	Locations	that	each	domain	registrant	locates	under	Cloudflare	organization	represented	by	country	code	with	the	
frequencies	in	percentages	of	English	dataset	
Figure	8:	Locations	that	each	domain	registrant	locates	under	Cloudflare	organization	represented	by	country	code	with	the	
frequencies	in	percentages	of	Thai	dataset	
Figure	9:	Vulnerability	status	of	English	and	Thai	dataset	with	frequencies	obtained	through	Google	Transparency	Report
12
IX Conclusion
FLIS	 service	 website	 contains	 high	 number	 of	 iframes	 and	 images.	 Due	 to	 the	 research	 about	 HTTP	
security	headers,	the	popular	ones	are	X-Frame-Options,	X-Xss-Protection	and	X-Content-Type-Options	
respectively.	By	the	way,	Thai	dataset	has	higher	HTTP	Security	Header	implementation	than	English	
one.	However,	none	of	website	was	implemented	HTTPS.	Numerous	of	the	websites	from	both	datasets	
were	located	at	US	under	CloudFlare	Inc.	
	
Regarding	all	investigations,	FLIS	service	websites	probably	do	not	aim	to	attack	users,	as	people	might	
understand.	 However,	 for	 the	 further	 research,	 this	 kind	 of	 service	 still	 could	 be	 examined	 more	
regarding	overlay	ads,	popup	and	advertisements,	which	could	bring	about	some	malicious	issues.	
	
In	 a	 nutshell,	 FLIS	 service	 websites	 do	 not	 have	 much	 differences	 regarding	 location	 and	 majority	 of	
contents	in	FLIS	service	are	not	malicious.	
	
X Problems
Nonetheless,	 a	 number	 of	 problems	 had	 been	 encountered.	 Regarding	 accumulating	 web	 pages,	
searching	 for	 100	 seed	 pages	 without	 redundancy	 from	 Google	 was	 not	 an	 easy	 task	 since	 Google	
generated	the	results	based	on	the	highest	recalls	and	precisions.	In	addition,	different	website	actually	
had	 different	 coding	 method;	 therefore,	 sometimes	 it	 caused	 bug	 to	 the	 program.	 For	 example,	 a	
number	 of	 websites	 used	 quotation	 mask	 inside	 crawled	 attributes,	 so	 MySQL	 got	 stuck	 with	 the	
problem.	Moreover,	normally,	both	Eclipse	and	mySQL	does	not	support	Thai	language;	thereby	an	extra	
configuration	 must	 be	 done.	 Besides,	 in	 order	 to	 collect	 all	 data,	 times	 and	 the	 Internet	 connection	
became	essential	factors	due	to	some	websites	needed	lots	of	time	to	download	as	a	result	of	a	large	
number	of	images.	Additionally,	some	crawled	web	stopped	working	and	was	inaccessible	unexpectedly	
after	aggregation,	which	might	probably	affect	the	crawling	result.	Furthermore,	the	results	from	two	
datasets	are	unable	to	compare	precisely	since	the	number	of	from	automated	crawl	datasets	are	not	
equal.	
	
XI Tradeoffs
An	automatic	crawler	depends	on	the	capability	of	Google	since	the	crawler	uses	the	search	engine	as	a	
tool	for	data	accumulation.	Besides,	as	a	result	of	using	metadata	of	original	datasets,	the	improvement	
of	the	crawler	always	depends	on	the	existing	data.	
	
XII Acknowledgements
The author would like to thank Assistant Professor Doudou Fall for his kindness to advise and
guide the way throughout the research.
13
Works Cited
[1] [Online]. https://www.securitee.org/files/flis_ndss16.pdf
[2] [Online]. https://tom.vg/papers/eusec_trust2014.pdf
[3] OWASP. (2016, June) Welcome to OWASP. [Online]. https://www.owasp.org
[4] ipinfo.io. [Online]. http://ipinfo.io
[5] SeleniumHQ. [Online]. http://www.seleniumhq.org
[6] (2016, July) Regular-Expression.info. [Online]. http://www.regular-expressions.info
[7] w3schools. w3schools. [Online]. http://www.w3schools.com/html/html_iframe.asp
[8] Dionach. (2014, September) Dionach. [Online]. https://www.dionach.com/blog/an-overview-
of-http-security-headers
[9] Isaac Dawson. (2014, March) Veracode. [Online].
https://www.veracode.com/blog/2014/03/guidelines-for-setting-security-headers
[10] Foundeo, Inc. 2012-1016. Content Security Policy (CSP) Quick Reference Guide. [Online].
https://content-security-policy.com
[11] Comodo. (2016) Instant SSL. [Online]. https://www.instantssl.com/ssl-certificate-
products/https.html
[12] RFC 2616 Fielding, et al. 10 Status Code Definitions. [Online].
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
[13] Python Software Foundation. (2016, June) HTTP protocol client. [Online].
https://docs.python.org/3/library/http.client.html
[14] Cloudflare. [Online]. https://www.cloudflare.com
[15] Domaintools. [Online]. http://whois.domaintools.com/couchtuner.ag

Más contenido relacionado

Similar a Internship

Consequences of dns-based Internet filtering
Consequences of dns-based Internet filteringConsequences of dns-based Internet filtering
Consequences of dns-based Internet filteringAfnic
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptSUNILKUMARSINGH
 
Computer and internet applications in medicine
Computer and internet applications in medicineComputer and internet applications in medicine
Computer and internet applications in medicineAhmed-Refat Refat
 
Internet And How It Works
Internet And How It WorksInternet And How It Works
Internet And How It Worksftz 420
 
Investigating Soap and Xml Technologies in Web Service
Investigating Soap and Xml Technologies in Web Service  Investigating Soap and Xml Technologies in Web Service
Investigating Soap and Xml Technologies in Web Service ijsc
 
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICE
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICEINVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICE
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICEijsc
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitAldrin Piri
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
Z39.50.pptx
Z39.50.pptxZ39.50.pptx
Z39.50.pptxlisbala
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
internet programming and java notes 5th sem mca
internet programming and java notes 5th sem mcainternet programming and java notes 5th sem mca
internet programming and java notes 5th sem mcaRenu Thakur
 
Web programming by Najeeb ullahAzad(1)
Web programming by Najeeb ullahAzad(1)Web programming by Najeeb ullahAzad(1)
Web programming by Najeeb ullahAzad(1)azadmcs
 
Abhishek srivastava ppt_web_tech
Abhishek srivastava ppt_web_techAbhishek srivastava ppt_web_tech
Abhishek srivastava ppt_web_techabhishek srivastav
 

Similar a Internship (20)

CN GP 5 WEB FTP EMAIL.pdf
CN GP 5 WEB FTP EMAIL.pdfCN GP 5 WEB FTP EMAIL.pdf
CN GP 5 WEB FTP EMAIL.pdf
 
Consequences of dns-based Internet filtering
Consequences of dns-based Internet filteringConsequences of dns-based Internet filtering
Consequences of dns-based Internet filtering
 
Assignment 01
Assignment 01Assignment 01
Assignment 01
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 
Computer and internet applications in medicine
Computer and internet applications in medicineComputer and internet applications in medicine
Computer and internet applications in medicine
 
Internet And How It Works
Internet And How It WorksInternet And How It Works
Internet And How It Works
 
Investigating Soap and Xml Technologies in Web Service
Investigating Soap and Xml Technologies in Web Service  Investigating Soap and Xml Technologies in Web Service
Investigating Soap and Xml Technologies in Web Service
 
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICE
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICEINVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICE
INVESTIGATING SOAP AND XML TECHNOLOGIES IN WEB SERVICE
 
Ict u5
Ict u5Ict u5
Ict u5
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
Z39.50.pptx
Z39.50.pptxZ39.50.pptx
Z39.50.pptx
 
Z39.50.pptx
Z39.50.pptxZ39.50.pptx
Z39.50.pptx
 
Internet
InternetInternet
Internet
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
internet programming and java notes 5th sem mca
internet programming and java notes 5th sem mcainternet programming and java notes 5th sem mca
internet programming and java notes 5th sem mca
 
Web programming by Najeeb ullahAzad(1)
Web programming by Najeeb ullahAzad(1)Web programming by Najeeb ullahAzad(1)
Web programming by Najeeb ullahAzad(1)
 
Abhishek srivastava ppt_web_tech
Abhishek srivastava ppt_web_techAbhishek srivastava ppt_web_tech
Abhishek srivastava ppt_web_tech
 

Internship

  • 1. 1 FLIS Service Investigation Shotirose Poramesanaporn NARA Institute of Science and Technology Mahidol University I Abstract Nowadays, Free Live Streaming Service (FLIS) becomes extremely popular; therefore, the security of the service should be determined. This paper presents information regarding FLIS service primarily based on security issues which comes from an investigation of two distinct datasets – English and Thai. The aim of doing the research was to analyze securities of FLIS service website by investigating and differentiating the websites depending on a location service. II Introduction Since the Internet becomes worldwide, a number of Internet services have immensely increased. One of the popular services was FLIS. FLIS or Free Live Streaming is a service that provides a free view of video contents for any user. In fact, the broadcast signal probably has no consent from a content owner even though the copyright of video contents might cost more than a billion; as a result, the service is apparently considered as illegal. Even though this kind of service is illegal, people might wonder why the number of service provider is large. As a result, characteristics of the service should be determined. Besides, FLIS service is all around the world with various service providers. The answer of whether there is any difference depending on implemented area should be clarified. Consequently, the securities of the service could be analyzed. Basically, the research was primarily depended on two research papers named, “It’s Free for a Reason: Exploring the Ecosystem of Free Live Streaming Services” [1], and “Large-scale Security Analysis of the Web: Challenges and Findings [2]”. The major objective of the first paper was to clarify a website crawling method by using Selenium, a tool for automating web applications for testing purposes. In fact, the tool could be used with several languages, for example, Python, C++ or Java. The primary objective of the second paper was to understand website’s securities. There are numerous of website’s securities for a website to use in order to protect itself along with users.
  • 2. 2 A. Insight FLIS Service In order to understand clearly about the service, it is important to examined components and the way it works. As shown in Figure 1, there are two main components in the service, a channel provider and an aggregator. Firstly, the channel provider works as a live signal receiver and broadcaster. Next, the aggregator is a page, which contains links to live signals. When a user clicks on any provided link on the aggregator page, the user will be redirected to a video player that covered by ads. Afterwards, the money, which comes from Click-Per- Thousand (CPT), Click-Per-Rate (CPR), Cost-Per-Click (CPC) or clicking on fake close button, will flow to the ad network, the aggregator, and the channel provider respectively. III Datasets There are two separated datasets since the web pages were accumulated through www.google.com, in English and in Thai keywords respectively. The process was done manually with the purpose of the lowest false positive number could be acquired. Finally, each dataset contained 100 URLs and individual URL became seed page. IV Crawler A crawler was created by Eclipse using Java language. The program was connected to MySQL through Xampp. A crawler kept data of links, iframes, images, HTTP Security Headers which were composed of X- Xss-Protection, X-Frame-Options, Content-Security-Policy, X-Content-Type-Options, Strict Transport- Policy, Public-Key-Pins, HTTPS usage, website’s location; ip address, hostname, city, region, country, coordinators, organization name and postal code, meta data including of vulnerability of each website. V Data Aggregation In order to collect data about links, iframes, and images, Selenium was used as a tool. The purpose of aggregating the mentioned data was to understand the characteristic of the FLIS website. For HTTP Security Headers, thanks to OWASP Secure Headers Project, the data could be retrieved through Terminal by using `curl -L -A "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 Figure 1: Components of FLIS service, from a paper called “It’s Free for a Reason: Exploring the Ecosystem of Free Live Streaming Services”
  • 3. 3 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36" -s -D - (https://www.example.com) -o /dev/null` command [3]. So do the website’s location, the data could be got through terminal with `curl ipinfo.io/(website’ ip address)` command [4]. Additionally, HTTPS usage was collected by checking the connecting port along with redirecting of the webpage. Since the URLs in datasets contained only "http://"; therefore if any page were using HTTPS, redirection must be happened. Furthermore, with the purpose of automatic dataset improvement, metadata of description and keywords tags of each website were recorded via Selenium WebDriver. Each sentence of obtained data was cut variously due to language structures. In order to be more precise, Regex was applied together with manually analysis. As a result, all assembling data was counted and given a score regarding frequency. Besides, vulnerability of websites had been checked through Google Transparency Report individually and recorded. Finally, all data was collected in 2 separate databases for different datasets. Each database was divided into several tables in order to reduce data redundancy. VI Selenium, a crawling tool As stated earlier, Selenium was used as a tool for data accumulation. Selenium is a set of different software tools each with a different approach to supporting test automation [5]. Generally, there are 4 main types of Selenium; IDE, Remote Control (RC), WebDriver and Selenium Grid. By the way, Selenium WebDriver was selected due to many reasons. Essentially, this type of the tool was flexible since it could be used with Java unlike IDE. Next, Selenium Remote Control was depreciated and replaced with this version instead. Furthermore, Selenium Grid had too many functions that were not necessary in this research definitely. VII Dataset Improvement A. Metadata Aggregation The aggregated metadata previously mentioned, at first, they were obtained in a string form, later, was separated into words by specific separators depending on languages. For example, Thai dataset uses
  • 4. 4 spacebar ( ) while English one uses comma (,). However, some words also contain special character e.g. ! or . which might cause less accuracy in further analysis; therefore, Regex, a regular expression (regex or regexp for short) a special text string for describing a search pattern [6], was applied. After that, each word was manually checked. B. Score Rating Next, each keyword from datasets had been assigned the score; so the dataset could be automatically improved in further processes. To assign the score for each word, the 10 highest redundant description words with 5 highest keywords were computed. The calculation was done by summation of all frequency results excluding stop words and websites’ names. After that, the frequency of each word will be divided by the total frequency then multiply by 100 (Formula: Score for each word = Frequency / Total frequency * 100). Table 1 and Table 2 show 10th highest redundant words; which were extracted from description of meta tag of the seed pages, frequency, and calculated score for each word from English and Thai dataset respectively. Table 1: Calculated scores of extracted words from description meta tag of English dataset Table 2: Calculated scores of extracted words from description meta tag of Thai dataset
  • 5. 5 Table 3 and Table 4 show 5th highest redundant words; which were extracted from keyword of meta tag of the seed pages, frequency, and calculated score for each word from English and Thai dataset respectively. C. Crawling Process Later, the first two highest score keywords were used for searching in Google. This process was done automatically through Selenium. At the beginning, keywords were placed in Google. Then, all the URLs from the results of Google in every page were collected and recorded into the database. Lastly, extracted meta descriptions and keywords of each URL were kept in the database as before. D. Score Assignment In order to calculate the score, the scores of metadata from seed pages were used. Words from new aggregated pages that match with the collected metadata in database would be assigned the score individually. At last, the score from keywords were summed up as a final score for every crawled URL. All mentioned processes were done robotically through MySQL Workbench. Table 3: Calculated scores of extracted words from keyword meta tag of English dataset Table 4: Calculated scores of extracted words from keyword meta tag of Thai dataset
  • 6. 6 E. Fault Reduction After getting a total score from each URL, in order to maintain accuracy, threshold was used. The page, which has final score greater than 50 will be manually check again before adding into the dataset. The reason of using 50 as a threshold value was since it generated a proper quantity of URL results with acceptable false positive number. Table 5 and Table 6 show the total number of the new pages that were automated crawl from Google. At the beginning, the ‘Total list from Google’, which were derived from placing the two highest redundant description keywords in Google, were recorded. After grading each word, the final score of individual URL was obtained and threshold was applied and recorded as ‘Total list which was scored more than 50’. Afterwards, the URL which was exactly same as the seed page was removed – ‘Total list after removed redundancy’. Finally, each URL was checked manually and noted the total number into the table as ‘Total list after manually check’. VIII Research Result A. General Characteristics According to Table 7, FLIS service normally contains high number of links and images on the webpage; nevertheless, low number of iframes (inline frames) [7]. This is due to the reason that this kind of website are always composed of movie posters including the links, sometimes overlay ads, malicious popup and advertisements, to illegal videos as shown in Figure 2 and Figure 3. Table 5: Number of new added lists in each process from automated crawling through Google of English dataset Table 6: Number of new added lists in each process from automated crawling through Google of Thai dataset Table 7: Average number of links, iframes and images from English and Thai dataset
  • 7. 7 B. HTTP Security Header One of the popular ways to implement website securities is to use HTTP Security Header [8]. Table 8 shows the number of HTTP Security Header usage of English and Thai dataset in percentage. From observation, the popular HTTP Security Headers that FLIS service websites usually implement were X-Frame-Options, X-Xss-Protection and X-Content-Type-Options respectively. Moreover, from the results, each of a website always has same configuration [9] as showing in ‘Setting’ column of Table 8. From the gathered data, the results could be concluded as Thai dataset has higher HTTP security header implementation than English one. Figure 2: An example of FLIS service webpage of English dataset from the top list of Google.co.jp, http://vumoo.at. Figure 3: An example of FLIS service webpage of Thai dataset, which contains overlay ads, malicious popup and advertisements, https://www.nungmovies-hd.com.
  • 8. 8 C. HTTPS Usage Hyper Text Transfer Protocol Secure (HTTPS) [11] usage test was conducted by checking port and redirection. Figure 4 shows the respond code [12] from connecting to each URL of English and Thai dataset. The x-axis represents respond port number from each URL connection while the y-axis shows number of frequency of the particular port in percentage. The result shows that in both datasets, there is neither any connection to HTTPS, port 443 [13], nor redirection of the webpage which clearly means that none of FLIS service website used HTTPS. D. Location By investigating ip address of every URL in datasets; location of each could be retrieved. Figure 5 and Figure 6 show research results. According to the figures, country code of each link indicates on x-axis and the frequency in percentage is represented on y-axis. From the showing graph, majority of FLIS service websites are located at US (United States of America) which probably causes by the location of the organization, will be discussed on the next section, that the website depending on. * [10] Figure 4: The respond code from redirecting test from English and Thai dataset Table 8: HTTP security header settings and implementations in percentage of English and Thai dataset
  • 9. 9 E. Organization Table 9 and Table 10 show lists of organizations that URLs in datasets depending on and their frequencies in percentages. Regarding to results, CloudFlare, Inc. has highest frequency in both datasets. Figure 5: A country code that each URL locates and frequencies in percentages of English dataset Figure 6: A country code that each URL locates and frequencies in percentages of Thai dataset
  • 10. 10 CloudFlare [14] is one of the most popular organization that websites depending on. This is due to the reasons that this organization could speed up the relying websites including provide some security supports. As mentioned on previous section that majority of websites are located at US due to the location of CloudFlare, Inc. Therefore, regarding to Domaintools [15], the location of domain registrant was tired to discover. However, just some of them could be disclosed. Regarding Figure 7 and Figure 8, x-axis indicates the country code of each URL under CloudFlare Inc. while the y-axis shows the frequency in percentages. Although the highest frequency of country still at US, some countries such as PA (Panama) and AU (Australia) were also other popular locations which were hiding under the organization. Table 9: A list of organizations and frequencies in percentages that each URL in English dataset relies on Table 10: A list of organizations and frequencies in percentage that each URL in Thai dataset relies on
  • 12. 12 IX Conclusion FLIS service website contains high number of iframes and images. Due to the research about HTTP security headers, the popular ones are X-Frame-Options, X-Xss-Protection and X-Content-Type-Options respectively. By the way, Thai dataset has higher HTTP Security Header implementation than English one. However, none of website was implemented HTTPS. Numerous of the websites from both datasets were located at US under CloudFlare Inc. Regarding all investigations, FLIS service websites probably do not aim to attack users, as people might understand. However, for the further research, this kind of service still could be examined more regarding overlay ads, popup and advertisements, which could bring about some malicious issues. In a nutshell, FLIS service websites do not have much differences regarding location and majority of contents in FLIS service are not malicious. X Problems Nonetheless, a number of problems had been encountered. Regarding accumulating web pages, searching for 100 seed pages without redundancy from Google was not an easy task since Google generated the results based on the highest recalls and precisions. In addition, different website actually had different coding method; therefore, sometimes it caused bug to the program. For example, a number of websites used quotation mask inside crawled attributes, so MySQL got stuck with the problem. Moreover, normally, both Eclipse and mySQL does not support Thai language; thereby an extra configuration must be done. Besides, in order to collect all data, times and the Internet connection became essential factors due to some websites needed lots of time to download as a result of a large number of images. Additionally, some crawled web stopped working and was inaccessible unexpectedly after aggregation, which might probably affect the crawling result. Furthermore, the results from two datasets are unable to compare precisely since the number of from automated crawl datasets are not equal. XI Tradeoffs An automatic crawler depends on the capability of Google since the crawler uses the search engine as a tool for data accumulation. Besides, as a result of using metadata of original datasets, the improvement of the crawler always depends on the existing data. XII Acknowledgements The author would like to thank Assistant Professor Doudou Fall for his kindness to advise and guide the way throughout the research.
  • 13. 13 Works Cited [1] [Online]. https://www.securitee.org/files/flis_ndss16.pdf [2] [Online]. https://tom.vg/papers/eusec_trust2014.pdf [3] OWASP. (2016, June) Welcome to OWASP. [Online]. https://www.owasp.org [4] ipinfo.io. [Online]. http://ipinfo.io [5] SeleniumHQ. [Online]. http://www.seleniumhq.org [6] (2016, July) Regular-Expression.info. [Online]. http://www.regular-expressions.info [7] w3schools. w3schools. [Online]. http://www.w3schools.com/html/html_iframe.asp [8] Dionach. (2014, September) Dionach. [Online]. https://www.dionach.com/blog/an-overview- of-http-security-headers [9] Isaac Dawson. (2014, March) Veracode. [Online]. https://www.veracode.com/blog/2014/03/guidelines-for-setting-security-headers [10] Foundeo, Inc. 2012-1016. Content Security Policy (CSP) Quick Reference Guide. [Online]. https://content-security-policy.com [11] Comodo. (2016) Instant SSL. [Online]. https://www.instantssl.com/ssl-certificate- products/https.html [12] RFC 2616 Fielding, et al. 10 Status Code Definitions. [Online]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html [13] Python Software Foundation. (2016, June) HTTP protocol client. [Online]. https://docs.python.org/3/library/http.client.html [14] Cloudflare. [Online]. https://www.cloudflare.com [15] Domaintools. [Online]. http://whois.domaintools.com/couchtuner.ag