2. About the speaker
Background in Management Engineering @
Politecnico of Milan
Database Architect @ KITeS (previosly CESPRI)
since 2002
Project manager for data production in EU Projects
STI-NET, TENIA, AEGIS and EU Tenders ICT
network impact, INNOVA, Higly Cited Patents,
Measurement and analysis of knowledge and R&D
exploitation flows, assessed by patent and licensing
data
Collaborations on database projects with: MIT, LSE,
Danish Board of Technology, Bonn Graduate School
of Economic, Universtät Mainz, BETA …
Redactor of blog rawpatentdata.blogspot.com
3. What is PATSTAT
is a snapshot of the EPO database for
over about 70 million applications from
more than 80 application authorities,
containing bibliographic data, citations
and family links. It requires the data to
be loaded in the customer's own
database.
+ low cost of ownership
- costs of implementation
4. Data Sorces for PATSTAT
Source for EP data is DOCDB (EPO
master documentation database)
Source for other offices are files
provided by other patent authorities
+ Good coverage for US, EU states, JP,
EPO, WIPO
- For other authorities gaps and leaks
not easy to identify
5. Implementing the DB (I)
Over 20 tables in
a relational DB
with application is
as main primary
key
EPO adds /
improves data
each ediction
6. Implementing the DB (II)
+ standard scripts, a growing community
to exchange procedures etc. (example)
- need a person who has both DB and
patent data knowledge
7. Plug & play extensions
Datasets that can be added with no effort:
Regpat: OECD dataset giving NUTS3 for each
applocant / inventor (EP only)
Han: OECD Harmonized applicants names
dataset (EP only)
eee_ppat: KUL/Eurostat standard names and
sector allocation (all patstat)
Tls221: Epo legal data table, allowing to include
changes of ownership, oppositions... (example)
ape-inv: Inventors disambiguation tools and
academic inventors.
Note: all tables, but TLS221 are free of cost
8. Some papers using Kites-Patstat
DB
Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research
Evaluation, 17(2): 87-102.
Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and
Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.
Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors.
European Management Review. The Journal of the European Academy of Management 5(2): 91-109
Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp.
418-432
Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics
of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118
Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale.
Economia e Politica Industriale.v.2 pp.43-70. [pdf]
Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) -
Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open
Economies: A Strategic Vision for Latin America". Forthcoming
Frazzoni S., Mancusi M., Rotondi Z., Sobrero M., Vezzulli A., (2011), “Relationship with banks and access to credit for innovation and
internationalization in SMEs”, L’EUROPA E OLTRE. Banche e imprese nella nuova globalizzazione, XVI Rapporto sul sistema
finanziario italiano, Edibank, 2011. ISBN 978-88-449-0495-1.
V. Sterzi: Patent quality and ownership: An analysis of UK faculty patenting, Research Policy, 2012 (forthcoming)
9. Some advanced
applications
OST patent applicants data quality
procedure and Match with ORBIS
OST common identifier among Patstat
WoS, Framework programs DBs
10. Applicants data quality
procedure and Match with
ORBIS (I)
Goal of the procedure is to clean and
standardize patent applicants names (ie
removing type of company, common
misspelling etc.)
After names C&S a procedure has been
developed in order to apply 5 different
match algorithms in order to give allow
the best matches with ORBIS company
names.
11. Applicants data quality
procedure and Match with
ORBIS (II)
Data quality procedure developed using
portable query and tables (see Tarasconi -
Sharing names/address cleaning patterns for Patstat
from patstat users day 2011)
Match procedure developed aiming to
be multiporpose (IE has already been used to match
TM vs Patents applicants @ KITeS)
Code and tables available for MySql and
Oracle.
http://documents.epo.org/projects/babylon/eponet.nsf/0/92ab5eb34ff406d1c125795d0050bbc
c/$FILE/PATSTAT_user_day_2011_presentations.zip
12. Applicants data quality
procedure and Match with
ORBIS (III)
C&S step results: from 12.280.000 pat.
applicants to about 3.800.000 companies
Match against: 353.294 Orbis Companies
in Nace 2540, 2630, 2651, 2910, 3030,
3011, 8422 (defense)
Results: 94726 Patent applicants against
66256 Orbis companies
Benchmark: Againsts a sample of 1%
validation returned a precision rate of 91%
and a recall of 95%
13. OST Common identifier (I)
Data cathegories existing across patent,
scientific publications and Framework
programs data:
PATSTAT FPS WOS
inventors/applicant participants
Geographic data s addresses addresses affiliations addresses
inventors,
Individuals applicants contacts authors
companies applicants participants affiliations
sci /tech taxonomies IPC TPs subject cathegories
14. OST Common identifier (II)
1)DEFINE ATOMIC ENTITIES AND NON
AMBIGUOS JOINS
Even if they regard similar entities there are
differences among datasets on the
granularity they use on data.
(ie in WOS affiliations may be by lab / dept while
patents may be by IP office: different size)
Bridge dataset should use a entity size
allowing unique data match across different
sets. This might need some changes also in
existing databases.
Bridge dataset should also make possible a
hierarchic structure of entities allowing
join at different level to main datasets.
16. OST Common identifier (IV)
2) TIMESERIES
2a) DATASET ASINCHRONIES
Data may enter the database with different time frame
depending from the dataset.
(IE PATSTAT is a full update so a snapshot at moment of
data creation, WOS is an incremental update; so name
changes/M&A could make same entity different in 2 datasets;
note also geographic entities change with time: counties,
countries…)
Bridge tables must have a time-related dimension.
2b) DATA TRANFORMATIONS
Data change within time.
(IE companies may merge, split [most critical case], change
name, change owner…)
Bridge tables must have a continuation dimension
allowing to follow transformation of entities.
17. OST Common identifier (V)
Timeseries examples
Sarajevo chg from YU to BS in 1992
BEFORE Sarajevo YU BS
AFTER Sarajevo YU 1800 1991
Sarajevo BS 1992 9999
18. OST Common identifier (V)
OBJECT / PROPERTIES DATASTRUCTURE
Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store
PROPERTIES/STATUS/EVENTS, so FI contain following fields:
PROPERTY NAME (ie ownership, affiliation…)
PROPERTYVALUE (ie new owner, new affiliation)
DATEFROM
DATETO
CHGREASON (if blank is still valid)
VALUE1…N (ie type of acquisition, % ownership…)
Along with properties must also be defined how properties are inherited among entities
(IE CNRS Bordeaux inherits from CNRS ownership, probably sector of activity… )
(1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer
Science Department of the University of Arizona
19. APPENDIX: Temporal database Example (I)
NOVARTIS
Novartis pharma is originated by merge of CIBA
(1884) GEIGY (1758) and Sandoz (1876)
Until 1970 they are 3 separate entities
LEGPCODE LEGPNAME
1 CIBA
2 GEIGHY
3 SANDOZ
4 CIBA SUB 1..N
5 GEIGHY SUB 1…N
6 SANDOZ SUB 1…N
LEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
1 OWNERSHIP FULLOWN 1 100 1884 9999
2 OWNERSHIP FULLOWN 2 100 1758 9999
3 OWNERSHIP FULLOWN 3 100 1876 9999
4 OWNERSHIP FULLOWN 1 100 1884 9999
5 OWNERSHIP FULLOWN 2 100 1758 9999
6 OWNERSHIP FULLOWN 3 100 1876 9999
19