10 years in the making. How real-world business cases have driven the development of CCC's deep search solutions, leading to the capabilities for web-crawling and delivery of targeted intelligence that helps R&D; intensive companies gain a competitive advantage.
5. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
5
6. 6
Having started in 2015 with just about 30,000 companies,
the company SEARCHCORPUS keeps growing and growing…
2015 2016 2017 2018 2019
• 30.000 company websites
• Duration 2 weeks:
• Crawling
• Indexing
• 50 GB of web data
• 60.000 company websites
• Still 2 weeks:
• Crawling
• Indexing
• 500 GB of web data
• 290.000 company websites
• Link depth 5
• Still 2 weeks:
• Crawling
• Geolocation
• Classification
• Indexing
• 2 TB of web data
2020 2021 2022
7. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
7
9. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
9
10. 10
……2015 ……………………………..….2016……………….………...………2017……
• Preparation of international rollout of domain specific targeted news trackers and alerting
• Animal Health Tracker
• RBB Tracker
• CRDI Tracker
• BD&L Tracker
• Single Sign On with automatic user provisioning (SAML)
11. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
11
12. 12
Australia / New Zealand
China
Republic of Korea
USA
Germany
European Union
Hong Kong
International (Springer)
Japan
Philippines
University Hospital Medical Information Network
WHO
13. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
13
14. 14
1. To obtain a reasonably sized input vector (remember, we classify a whole website
which may have several 100 MB of content), we convert the data into a vector
using a TF-IDF pre-processor trained on a corpus collected for the project
2. Support Vector Machines alone is not good enough, therefore pre-processing of
all input with a custom thesaurus is necessary
3. For all 6 real world samples we got > 96% average recognition rate
15. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
15
16. 16
Transferring 1 page from London: 82ms, 500 pages: 41 seconds, 1.000 servers: 11,5 hours
Transferring 1 page from Tokyo: 1.200ms, 500 pages: 10 minutes, 1.000 servers: 6 days 23 hours
NASA’s
Terra
satellite
for
the
MODIS
imageries,
combined
by
Meow.
Credit:
NASA
Goddard
Space
Flight
Center
Image
by
Reto
Stöckli
(land
surface,
shallow
water,
clouds).
Enhancements
by
Robert
Simmon
(ocean
color,
compositing,
3D
globes,
animation).
Data
and
technical
support:
MODIS
Land
Group;
MODIS
Science
Data
Support
Team;
MODIS
Atmosphere
Group;
MODIS
Ocean
Group
Additional
data:
USGS
EROS
Data
Center
(topography);
USGS
Terrestrial
Remote
Sensing
Flagstaff
Field
Center
(Antarctica);
Defense
Meteorological
Satellite
Program
(city
lights).,
Public
domain,
via
Wikimedia
Commons
17. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
17
18. 18
Corporate
Websites
News portals
News feeds
Crawlers
{APIs}
{APIs}
3rd party APIs
Licensed 3rd party
content
Feed readers
News archive (un)rated news
Consume
{APIs}
{
standard
API
}
Model + meta data
Request news rating
for selected model
Return news rated with
selected model
Deploy selected
model
Crawl
/
retrieve
news
Rate news with
selected model
extracted news
Publish model + metadata
Optimization of
models, retraining
model rating
Deploy and verify model
19. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
10. 2022: Automating regulatory intelligence collection and classification
(will be integrated with intranet applications to manage regulatory events)
19