SlideShare a Scribd company logo
1 of 20
Download to read offline
AI-SDV 2022, Oct. 10/11 2022
Klaus Kater
Director, Research & Development
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3
4
Look and feel 2014 Look and feel 2022
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
5
6
Having started in 2015 with just about 30,000 companies,
the company SEARCHCORPUS keeps growing and growing…
2015 2016 2017 2018 2019
• 30.000 company websites
• Duration 2 weeks:
• Crawling
• Indexing
• 50 GB of web data
• 60.000 company websites
• Still 2 weeks:
• Crawling
• Indexing
• 500 GB of web data
• 290.000 company websites
• Link depth 5
• Still 2 weeks:
• Crawling
• Geolocation
• Classification
• Indexing
• 2 TB of web data
2020 2021 2022
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
7
8
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
9
10
……2015 ……………………………..….2016……………….………...………2017……
• Preparation of international rollout of domain specific targeted news trackers and alerting
• Animal Health Tracker
• RBB Tracker
• CRDI Tracker
• BD&L Tracker
• Single Sign On with automatic user provisioning (SAML)
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
11
12
Australia / New Zealand
China
Republic of Korea
USA
Germany
European Union
Hong Kong
International (Springer)
Japan
Philippines
University Hospital Medical Information Network
WHO
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
13
14
1. To obtain a reasonably sized input vector (remember, we classify a whole website
which may have several 100 MB of content), we convert the data into a vector
using a TF-IDF pre-processor trained on a corpus collected for the project
2. Support Vector Machines alone is not good enough​, therefore pre-processing of
all input with a custom thesaurus is necessary
3. For all 6 real world samples we got > 96% average recognition rate
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
15
16
Transferring 1 page from London: 82ms​, 500 pages: 41 seconds​, 1.000 servers: 11,5 hours
Transferring 1 page from Tokyo: 1.200ms​, 500 pages: 10 minutes​, 1.000 servers: 6 days 23 hours
NASA’s
Terra
satellite
for
the
MODIS
imageries,
combined
by
Meow.
Credit:
NASA
Goddard
Space
Flight
Center
Image
by
Reto
Stöckli
(land
surface,
shallow
water,
clouds).
Enhancements
by
Robert
Simmon
(ocean
color,
compositing,
3D
globes,
animation).
Data
and
technical
support:
MODIS
Land
Group;
MODIS
Science
Data
Support
Team;
MODIS
Atmosphere
Group;
MODIS
Ocean
Group
Additional
data:
USGS
EROS
Data
Center
(topography);
USGS
Terrestrial
Remote
Sensing
Flagstaff
Field
Center
(Antarctica);
Defense
Meteorological
Satellite
Program
(city
lights).,
Public
domain,
via
Wikimedia
Commons
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
17
18
Corporate
Websites
News portals
News feeds
Crawlers
{APIs}
{APIs}
3rd party APIs
Licensed 3rd party
content
Feed readers
News archive (un)rated news
Consume
{APIs}
{
standard
API
}
Model + meta data
Request news rating
for selected model
Return news rated with
selected model
Deploy selected
model
Crawl
/
retrieve
news





Rate news with
selected model

extracted news

Publish model + metadata
Optimization of
models, retraining
 model rating
Deploy and verify model
1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com
2. 2014: Prototyping a configurable crawler framework
3. 2015: Rollout of Company SEARCHCORPUS
4. 2016: Thesaurus management for domain specific content selection
5. 2017: Establishing a process to roll out news trackers and other crawling solutions
6. 2018: Clinical Trial Registry Tracker
7. 2018: Machine learning based classification of company websites
8. 2019: Globally distributed massive parallel crawling
9. 2020/21: Deep learning for automated news rating
10. 2022: Automating regulatory intelligence collection and classification
(will be integrated with intranet applications to manage regulatory events)
19
Klaus Kater
kkater@copyright.com

More Related Content

Similar to AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of googlemaelmardi
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paperdidip
 
Geo know general presentation 2013
Geo know general presentation 2013Geo know general presentation 2013
Geo know general presentation 2013geoknow
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...Bitsytask
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Azure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriAzure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriKoray Kocabas
 
Support Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSupport Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSmartBear
 
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTourismFastForward
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 

Similar to AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE ) (20)

The anatomy of google
The anatomy of googleThe anatomy of google
The anatomy of google
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Going for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked MetadataGoing for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked Metadata
 
Google Research Paper
Google Research PaperGoogle Research Paper
Google Research Paper
 
Test
TestTest
Test
 
Google
GoogleGoogle
Google
 
Geo know general presentation 2013
Geo know general presentation 2013Geo know general presentation 2013
Geo know general presentation 2013
 
A04210106
A04210106A04210106
A04210106
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
History   page-brin thesis - anatomy of a large scale hypertextual web search...History   page-brin thesis - anatomy of a large scale hypertextual web search...
History page-brin thesis - anatomy of a large scale hypertextual web search...
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Azure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleriAzure machine learning ile tahminleme modelleri
Azure machine learning ile tahminleme modelleri
 
Support Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First ApproachSupport Rapid Systems Growth with a Design-First Approach
Support Rapid Systems Growth with a Design-First Approach
 
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen TechnologienTFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
TFF2016, Rudi Studer, Smarte Dienstleistungen mit semantischen Technologien
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 

More from Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Recently uploaded

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 

Recently uploaded (17)

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 

AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of experience Klaus Kater (Copyright Clearance Center, DE )

  • 1. AI-SDV 2022, Oct. 10/11 2022 Klaus Kater Director, Research & Development
  • 2. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2
  • 3. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3
  • 4. 4 Look and feel 2014 Look and feel 2022
  • 5. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 5
  • 6. 6 Having started in 2015 with just about 30,000 companies, the company SEARCHCORPUS keeps growing and growing… 2015 2016 2017 2018 2019 • 30.000 company websites • Duration 2 weeks: • Crawling • Indexing • 50 GB of web data • 60.000 company websites • Still 2 weeks: • Crawling • Indexing • 500 GB of web data • 290.000 company websites • Link depth 5 • Still 2 weeks: • Crawling • Geolocation • Classification • Indexing • 2 TB of web data 2020 2021 2022
  • 7. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 7
  • 8. 8
  • 9. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 9
  • 10. 10 ……2015 ……………………………..….2016……………….………...………2017…… • Preparation of international rollout of domain specific targeted news trackers and alerting • Animal Health Tracker • RBB Tracker • CRDI Tracker • BD&L Tracker • Single Sign On with automatic user provisioning (SAML)
  • 11. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 11
  • 12. 12 Australia / New Zealand China Republic of Korea USA Germany European Union Hong Kong International (Springer) Japan Philippines University Hospital Medical Information Network WHO
  • 13. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 13
  • 14. 14 1. To obtain a reasonably sized input vector (remember, we classify a whole website which may have several 100 MB of content), we convert the data into a vector using a TF-IDF pre-processor trained on a corpus collected for the project 2. Support Vector Machines alone is not good enough​, therefore pre-processing of all input with a custom thesaurus is necessary 3. For all 6 real world samples we got > 96% average recognition rate
  • 15. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 15
  • 16. 16 Transferring 1 page from London: 82ms​, 500 pages: 41 seconds​, 1.000 servers: 11,5 hours Transferring 1 page from Tokyo: 1.200ms​, 500 pages: 10 minutes​, 1.000 servers: 6 days 23 hours NASA’s Terra satellite for the MODIS imageries, combined by Meow. Credit: NASA Goddard Space Flight Center Image by Reto Stöckli (land surface, shallow water, clouds). Enhancements by Robert Simmon (ocean color, compositing, 3D globes, animation). Data and technical support: MODIS Land Group; MODIS Science Data Support Team; MODIS Atmosphere Group; MODIS Ocean Group Additional data: USGS EROS Data Center (topography); USGS Terrestrial Remote Sensing Flagstaff Field Center (Antarctica); Defense Meteorological Satellite Program (city lights)., Public domain, via Wikimedia Commons
  • 17. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 17
  • 18. 18 Corporate Websites News portals News feeds Crawlers {APIs} {APIs} 3rd party APIs Licensed 3rd party content Feed readers News archive (un)rated news Consume {APIs} { standard API } Model + meta data Request news rating for selected model Return news rated with selected model Deploy selected model Crawl / retrieve news      Rate news with selected model  extracted news  Publish model + metadata Optimization of models, retraining  model rating Deploy and verify model
  • 19. 1. 2012-2013: Crawling metasearch results Yahoo.com / Bing.com 2. 2014: Prototyping a configurable crawler framework 3. 2015: Rollout of Company SEARCHCORPUS 4. 2016: Thesaurus management for domain specific content selection 5. 2017: Establishing a process to roll out news trackers and other crawling solutions 6. 2018: Clinical Trial Registry Tracker 7. 2018: Machine learning based classification of company websites 8. 2019: Globally distributed massive parallel crawling 9. 2020/21: Deep learning for automated news rating 10. 2022: Automating regulatory intelligence collection and classification (will be integrated with intranet applications to manage regulatory events) 19