Micro-Scholarship, What it is, How can it help me.pdf
Deep Web and Digital Investigations
1. Deep Web and Digital
Investigations
Damir Delija
Milano 2014
1
2. What we will talk about
• Web and “Deep Web”
• Web and documents
• Definitions
• Technical issues
• Forensic issues
• I’m not an expert on deep or dark web
• Discussion based on many sources and
references
3. Inaccessible Web
• Deep Web is a name for data inaccessible by
regular search engines on the Internet
• Deep Web sounds much better than
inaccessible
• Searchable / Accessible web is also called
surface web
• Dark web is part of www with illegal or
immoral content
• Dark web is not Deep Web it is part of it, but
dark pages are on the surface web too
4. Inaccessible Resources
• Inaccessible resources
– it exists but we don’t know about it or it’s location
– we can’t use it
• It is an old problem
– you have it, even in your own room
• Is there any solution ?
– idea from Gopher days, Veronica
– it works well with static pages and data
– abandoned in web days, becomes a source of tremendous
power and wealth for Search Engines
5. Web and Internet and Documents
• WWW is not the Internet ☺
– also full data or document space of each networked
computer is not part of the Internet
• WWW is hypertext document based structure
– we have links among documents
– a document is not necessarily a web page
– documents must have a presentation ability to be visible
through the web interface (transcription layer, often
dynamicaly generated)
– Links, web pages and documents can be static or
dynamically generated
– Dynamic documents are here because of volume of data
(can’t be organised in static pages)
Definitions are crucial in understandig deep and
surface web
6. Volume of Data
• For each document there is in average of 11
copies in the system
– enterprise measurements pre SAN calculation
• Shows how document space expands rapidly
• Even simple mail can cause data avalanches
• From sourface web point of view ?
• Mostly invisible
• From Deep Web point of view ?
• Data/documents copies are probably floating
around, inaccessible to us
7. Web and Search Engines
• Web can access material which is only
referenced by a link and is not access
protected
• Today mostly we assumes search engine span
equals web and Internet
• To be effective search engines must have pre
organised data to answer query
• Enormous changing volume of collected data
and propagation lag
http://en.wikipedia.org/wiki/List_of_search_engines
8. Deep Resources
• Deep Web depends on the method of how
search engines acquire and store data
• Web can be crawled or explored as link space
• Hints are cache, proxy, protocol traffic
• No clear boundary between deep resources
and surface resources
9. Uncollectible Resources
Deep Web Resources
• Dynamic Web Pages
– returns in response to a query or accessed only through a form
• Unlinked Contents
– Pages without any backlinks
• Private Web
– sites requiring registration and login (password-protected resources)
• Limited Access web
– Sites with captchas, no-cache pragma http headers
• Scripted Pages
– Page produced by javascript, Flash, AJAX etc
• Non HTML contents
– Multimedia files e.g. images or videos
10. Uncollectible Resources
Documents and Disk Space
• This comes close to e-discovery field
• Is this part of Deep Web ?
• Documents not in the web tree
• accessible only by direct filesystem access
• or by dedicated script effort
• Files generally on the web servers and no-web
servers machines
– accessible only by direct filesystem access
11. Forgotten Data
• From the security aspect, forgotten data is a
very interesting part of Deep Web
• What is forgotten data – maybe data without
custodian ?
• Verizon reported about big data breach from
2008,
– unknown data being part of data breach in 66% of
incidents
12. Data Lifecycle
• Data creation and circulation
• How to find data and correlate it
• Search engines
• Proxies
• Metadata, Logs , Feeds
• Very interesting ideas in “Programming
Collective Intelligence” By: Toby Segaran,
O'Reilly Media, August 16, 2007
13. Hidden Data in Surface web ?
• Web handles data available trough html and
extensions
• What about metadata and embedded data which
is not accessible for search engines ?
14. Surface Web and Deep Issues
• “Hidden Data in Internet Published Documents”
– deep forensic impact
• Specific data formats can have embedded
elements which is not visible to search engine
– like thumb views embeded in pictures
– exif data in images
– metadata in documents
– stego
15. Idea of Treasure Island
• What is not on the map is unknown
• Hiden as treasure island
• Idea of unexplored, uncharted with big gains ..
• Because of size idea of Iceberg
16. Why Deep Web Exists ?
• Why search engine fails?
– Technology
• Most of the web data is behind dynamically
generated pages (web gateways)
– Web crawler cannot reach them or data not announced
– Can only be obtained if we have access to the system
containing the information
– Forms have to populated with values
– understanding the semantic of the web gateway and
data behind it
17. Measuring the Deep Web
• How to measure – estimates are based on known
examples
• Try to generate pages based on known home pages
and explore the link space, based on hop distances
• First Attempt: Bergman (2000)
– Size of surface web is around 19 TB
– Size of Deep Web is around 7500 TB
– Deep Web is nearly 400 times larger than the Surface Web
• 2004 Mitesh classified the Deep Web more accurately
– Most of the html forms are two hops from the home page
18. Deep Web Size
Current Estimates 2014
• Deep Web about 7500 Terabytes
• Surface Web about 19 terabytes
• Deep Web has between 400 and 550 times more
public information than the Surface Web.
• 95% of the Deep Web is publically accessible
• More than 200,000 Deep Web sites currently exist.
• 550 billion documents on Deep Web
• 1 billion documents on Surface Web
19. History of Deep Web
• Start: static html pages, web crawlers can easily
reach, only few cgi-scripts
• In mid-90’s: Introduction of dynamic pages, page
generated as a result of a query or link access
• In 1994: Jill Ellsworth used the term “Invisible
Web” to refer to these websites.
• In 2001, Bergman coined it as “Deep Web”
• Dark web goes in parallel as crime start to spread
over the Internet
20. Rough Timeline
• 2001: Raghavan et al -> Hidden Web Exposure
– domain specific human assisted crawler
• 2002: Stumbleupon used Human Crawler
– human crawlers can find relevant links that algorithmic crawlers miss.
• 2003: Bergman introduced LexiBot
– used for quantifying the Deep Web
• 2004: Yahoo! Content Acquisition Program
– paid inclusion for webmasters
• 2005: Yahoo! Subscriptions
– Yahoo started searching subcription only sites
• 2005: Noulas et. al. -> Hidden Web Crawler
– automatically generated meaningful queries to issue against search form
• 2005: Google site map
– Allows webmasters to inform search engines about urls on their websites that
are available for crawling.
– Web 2.0 infrastructure
– Today Mobile device and Internet of things
– each gadget can have (and has) web server for configuration
22. From Digital Forensic Viewpoint
• Is there a way to carry out forensically sound
actions on Deep Web ?
• Can we apply standard digital forensic
procedures and best practices ?
• In both cases yes,
– we are always limited in digital forensics, but that
does not prevent reliable results
23. Web and Digital Forensic
• Web is web ☺
• Web artifacts are web artifacts
• The type of investigation determines how we
handle web data
– key element is: legal
• Many possible scenarios and situations
– follow the forensic principles and best practices as
in any other situation
– use scientific method
– test and experiment to prove method
24. Deep Web and Forensic Tasks
• How to prove access to Deep Web resources
– same as ordinary resources, because it is mostly
through browsers
– advantage over blind Deep Web access since there
are history, cache, log artifacts which shows which
Deep Web resource was accessed
• Deep Web artifacts
– Mostly like any other web artifacts
– Hidden Data in Internet Published Documents
– Dark web as a specific subrange
25. Forensic Tools Issues
• Forensics of specialised browsers and access tools
– Thor / onion
– Unusual browsers/accessing tools links, lynx, wget
– Other browsers 12P Freenet
• Key Question: Does our forensic framework
support such tools?
– Internet Evidence Finder
– Encase
– FTK
– If not how to handle artifacts and data ?
• What about mobile devices?
26. Conclusion and Questions
• Challenging field
• Size will grow with IPv6 take over and
“Internet of things” concept
• Cloud concept is important (size, acces, legal
isuses)
• Each new tehnology will add a new layer of
invisibility eg. complexity
• Size of available data simply force use of
dynamic web pages
27. References
Too many links ...
• http://papergirls.wordpress.com/2008/10/07/timeline-deep-
web
• http://deepwebtechblog.com/federated-search-finds-content-
that-google-can’t-reach-part-i-of-iii
• http://deepwebtechblog.com/a-federated-search-primer-
part-ii-of-iii
• http://googleblog.blogspot.com/2008/07/we-knew-web-
was-big.html
• http://www.online-college-blog.com/features/100-
useful-tips-and-tools-to-research-the-deep-web/