2. o A research directed by Guido M. Rey has resulted in the volume «La
mafia come impresa. Analisi del sistema economico criminale e dele
politiche di contrasto» (2017)
o In the chapter «Dalle parole ai numeri : estrarre dati dalle sentenze della
magistratura» the results obtained from the analysis of about 5,000
judgements issued by the Corte di Cassazione are presented.
o Increase the results obtained from the text mining of sentences through
the interaction of multiple data sources.
o Evaluation of completeness and reliability of data.
o Organize database(s) aimed at estimating statistical models
1
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
1
Aims
Starting point
Goals
3. 2
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
2
Exercise:
Integration of data from multiple sources
① Judgments issued by the Corte di Cassazione (www.italgiure.it) : Open Data PA
② Orbis : database of economic enterprises accessible with the resources of EMBeDS
(Economics and Management in the era of Data Science), project winner in the MIUR
selection of Departments of Excellence 2018-2022 http://embeds.santannapisa.it/
A subset of 308 sentences has been extracted from the selected 4,632 judgments (from 2012
to September 2016) with one or more of the words “corruzione”, “concussione”, “turbativa” e
“appalto”.
• Issued in 2014
• Containing references to professional roles held in the Public Administration
4. oCreation of a Corpus with the texts of the judgements
oVocabulary (words and lemma)
oGrammatical and semantic Tagging
oIdentification of Multiwords and segments
oText mining
Through the TalTaC2
package
3
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
3
Step 1:
Import texts of sentences and text mining
5. Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
4
Information chart
Come si può vedere dalla figura seguente, il centro delle informazioni è costituito dal singolo
evento criminoso, che coinvolge attori (singoli o aggregati), che viene individuato / sanzionato,
che si svolge in un luogo geografico specifico, in una data (o periodo) certa, con determinate
modalità, con un valore economico determinato.
Fa parte
/lavora
per
Evento criminoso
persona
persona
Tribunale
Polizia
Sanzionato
/Individuato
Valore economico
Euro
coinvolge
quando
dove
come
Ai
danni
di
Insie
me a
Ass
criminale
Ente
Pubblico
Azienda
luogo
periodo
WHO
WHEN
WHERE
WHAT
HOW
Economic value
6. 5
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
5
Guidelines followed for matching with Orbis
-- The matching procedure must be automatic or automatable: repeatable with lists
obtained from a higher number of judgments and without the intervention of
"manual" choices
-- The presence of data / information on natural persons in clear does not pose privacy
problems, because this information is not extracted "per se" but it constitutes the premise for
obtaining a correct and reliable matching: the data are still treated in a statistical way
(anonymously)
7. 6
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
6
Step 2:
Matching with Orbis (1)
«Batch search» (automatic) in two consecutive steps:
Companies : list obtained from Taltac2 by exporting name and identification of
the sentence
Persons (defendants): list of defendants obtained by Taltac by exporting
graphic forms with semantic tagging «defendants» (multiword graphic form
with name and surname or surname and name) and date of birth
8. 7
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
7
Step 2:
Matching with Orbis (2)
RESULTS of the «Batch search» (automatic) in two consecutive steps:
Companies Input : 400 companies of wihich 228 with A score
186 unique companies
(due to the presence of the company name in several judgments
or the name written by judges with more variations)
Person Input (defendants): 408 defendants (unique, no repetitions)
16 validated records (automatic comparison between date of birth and part of
the social security number) + 6 individual companies
A Excellent total score >= 95%
B Good total score between 85 and 94%
The automated process produces a
matching score for each record.
Our quality indicator uses the
following scoring criteria:
9. 8
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
8
Step 3:
Information contribution from Orbis:
variables with high information potential
What data do we add to those already available?
Company status
Business size
Statistical classification of activities
Start year
Budget data
….
BUT ALSO THE NAMES OF THE TOP MANAGEMENT AND OWNERS
Again with a view to anonymous treatment, they can be used to identify a network of
companies.
Not interesting "per se" (we are not a detective agency) but holders of other individual
companies and / or family (founded after the outcome of the judgment).
NB: the names of the defendants are clear in the source Corte di Cassazione, as it is the
last court level.
10. 9
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
9
TalTaC results:
The automatic classification of judgments
11. 10
Cluster 1 (n=119) :
presence of organized crime
Cluster 2 (n=177) :
concussion /corruption in the PA
cosca pubblico ufficiale
associazione mafiosa concussione
associazione privato
Nome1 costrizione
sodalizio corruzione
partecipazione induzione
conversazione servizio
estorsione CP
ndrangheta ufficio
clan abuso
Nome2 prescrizione
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
10
How to interpret clusters
First 11 words characterizing the 2 main identified clusters
12. Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
11
Not just text mining but help in the interpretation
The interaction between the results of the textual analysis and the new
information that can be acquired with other databases (administrative or not) is
the novelty of the approach that is presented.
The questions we would like to answer:
Companies present in sentences have characteristics different from those not
present?
Do the companies, belonging to a cluster and present in the judgments, differ?
Example: Different by company size, economic sector, geographical location?
13. Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
12
Regions and companies by cluster
Region
Cluster 1
Reati + org crim
Cluster 2
Reati e PA Total
# sentenze # imprese # sentenze # imprese # sentenze # imprese
Abruzzo 1 1 1 1
Calabria 11 33 1 1 12 34
Campania 6 21 8 13 14 34
Emilia-Romagna 2 7 2 7
Lazio 1 1 5 6 6 7
Liguria 1 1 1 2 2 3
Lombardia 1 13 6 17 7 30
Marche 3 8 3 8
Molise 1 1 1 1
Piemonte 1 10 1 10
Puglia 5 14 5 14
Sardegna 1 1 1 1
Sicilia 5 13 5 11 10 24
Toscana 1 1 4 10 5 11
Veneto 4 17 4 17
Total 26 83 48 119 74 202
Dati provvisori
e parziali
14. 13
National legal form
Number of
companies
Consortium + Consortium with external activity 4
Cooperative company ( SCARL + SCARLPA) 4
Joint stock company - SPA 25
Limited liability company - SRL 121
Limited partnership - SAS 2
One-person company with limited liability - SRLU 21
One-person joint stock company - SPA 3
Sole proprietorship 2
n.d. 4
Total 186
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
13
Companies by national legal form
Provisional and partial data
To be added 22 one-
person companies
obtained from the
list of defendants
15. 14
Status number of companies
Active 135
Active (default of payment) 1
Bankruptcy 1
Dissolved 5
Dissolved (bankruptcy) 16
Dissolved (liquidation) 5
Dissolved (merger or take-over) 6
In liquidation 11
Status unknown 6
Totale 186
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
14
Companies by status
Provisional and partial data
16. 15
Areas
Status
Active Others Status unknown Total
ITC - Northwest 29 12 1 42
ITH - Northeast 22 12 34
ITI - Centre 33 9 42
ITF - South 26 8 4 38
ITG - Insular Italy 15 4 1 20
(blank) 10 0 10
Total 135 45 6 186
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
15
Companies by Geographical Areas and status
Provisional and partial data
Others:
Active (default of payment)
Bankruptcy
Dissolved
Dissolved (bankruptcy)
Dissolved (liquidation)
Dissolved (merger or take-over)
In liquidation
17. 16
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
16
Discussion
The potential sources of data and information are many and each one is organized
according to its own purposes.
The use for statistical purposes obliges to have to take into account some aspects,
sometimes neglected when talking about Big Data or Open Data:
• The completeness of the information
• The time base of the information acquired or possibly acquired
18. 17
Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
17
Final goal : the «statistical» DataBase
The database thus obtained will allow reconstructions and analysis starting from
any element (Company, Public Body, persons, period, place, etc) provided that it
is correctly identified as such within the texts of the judgments.
It is, therefore, necessary to use several tools:
Text mining for processing the information contained in the texts of
the sentences and transform them into data that can be analysed
statistically
Validate and integrate this data with other information and data from
other administrative databases / records.
The greater the completeness and reliability of the other databases, the greater
the information value of the statistical analysis carried out on the statistical
database.
19. Maria Francesca Romano
Institute of Economics & EMbeDS - Scuola Superiore Sant’Anna
18
Credits
Un ringraziamento a:
Fabrizio Alboni
Daniela Arlia
Antonella Baldassarini
Lorenzo Bartalini
Pietro Battiston
Sergio Bolasco
Alberto di Martino
Giuseppe Di Vetta
Pasquale Pavone
Guido M. Rey