Presented at the SIKS Smart Auditing Workshop, 25 Feb 2015.
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a 'paper reality'. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more 'real-world' sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle data quality problems. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology as well as how data quality problems can be managed with probabilistic databases. In terms of the 4 V's of big data -- volume, velocity, variety and veracity -- this presentation focuses on the third and fourth V's: variety and veracity.
ECOSOC YOUTH FORUM 2024 Side Events Schedule-18 April.
Dealing with poor data quality of osint data in fraud risk analysis
1. DEALING WITH POOR DATA QUALITY OF
OSINT DATA IN FRAUD RISK ANALYSIS
MAURICE VAN KEULEN
2. Largest part [of money
to reclaim] is due to
payments to people who
were not entitled to it.
… earlier, it didn’t pay
off to reclaim the
money.
[Telegraaf Jan 2012]
Capelle a/d IJssel 2011
164 cases of fraud
yielded 1.2 million
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 2
3. “If a strong suspicion of fraud arises, social inspectors
start an investigation with the receiver of social security”
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 3
HOW DOES DIGITAL FRAUD DETECTION WORK?
(IN CASE OF SOCIAL SECURITY FRAUD)
• Data from
applicant
Application
• Data from
governmental
databases
Coupling
• Extraction of
“indicators”
• Data mining:
classification
Fraud risk
analysis
• Selection of
cases from
risk classes
Investigation
Municipalities are responsible for fraud detection.
Inspection ISZW (department of Ministry) assists
them with training the classifiers.
4. Doesn’t work as well as expected
• Estimation of fraud risk not accurate enough
Main cause: the data represents a “paper reality”
Solution: Enrich data with other independent ‘data traces’
Independent indicators closer to real-world
Discrepancy indicators
Where can data traces from the real world also be found?
• Websites, social media
Open Source Intelligence (OSINT)
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 4
THE “BLIND SPOT”
5. Auditing (Unit-4)
• Fraudsters will disguise illegitimate transactions by keeping
them “out of the books”
• If you look only in the books, you find nothing missing
• Solution = Find indications of missing transactions
(involved people, goods, money) … all these leave data
traces somewhere …
Asbestos removal (ISZW)
• Less obligatory protection measures for a lower price
• Official price vs. advertised price
• Bad experiences or suspicions mentioned in web forums
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 5
OTHER EXAMPLES
6. Enriched data
Databases
/
Knowledge
bases
Information
on
websites
Text
fragments
from social
media
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 6
INFORMATION COMBINATION AND ENRICHMENT (ICE)
Web harvesting
• Search
• Navigate
• Extract
• Store
Information extraction
(NLP / IR)
• Entity extraction
• Entity disambig.
• Entity relationships
• Fact extraction
• Sentiment / class
Better indicators
Better risk analysis
Better fraud detection
7. 25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 7
WEB HARVESTING
http://www.sony.co.jp /SonyInfo/News/Press/2 0 1 4 0 4 /1 4 -0 3 7 /
http://www.sony.co.jp /SonyInfo/News/Press/2 0 1 4 0 4 /1 4 -0 3 3 /
http://www.sony.co.jp /SonyInfo/News/Press/2 0 1 2 1 1 /1 2 -1 7 2 /
http://www.sony.co.jp /SonyInfo/News/Press/2 0 1 2 0 9 /1 2 -1 2 6 /
http://www.sony.co.jp /SonyInfo/News/Press/2 0 1 2 0 9 /1 2 -1 1 9 /
vid eo の検索結果 約1 1 ,9 0 3 件中 1 - 1 0 件を表示
W alkm an
VAIO
x-アプリ
FeliCa
KV-2 5 DA6 5
RC-S3 2 0
PS4
Xp eria
製品登録
Z2
video
Sony Jap an | ニュ ースリ リ ース | プロフ ェ ッ ショ ナルディ スク に対応するXDCAM ™ 商品…
ソ ニーは、 2 /3 型IT型CCDを搭載し 、 S/Nの向上(6 2 d B)など高画質化を実現し たXDCAM HD 4
2 2 カ ムコ ーダー「 PDW -8 5 0 」 と 、 2 /3 型FIT型CCDを搭載し た「 PDW -7 5 0 」 を発売し ます。
Sony Jap an | ニュ ースリ リ ース | 眼科検査用の顕微鏡に対応し 、 前眼部の映像を高精細に…
ソ ニーは、 眼科検査用の顕微鏡(スリ ッ ト ラ ンプ)に装着し て、 顕微鏡を覗く 医師と 同様の映像を
高精細なフルHDで撮影する、 CM OSフルHDビデオカ メ ラ 「 M CC-5 0 0 M D」 を発売し ます。
Sony Jap an | 5 ,2 0 0 ルーメ ンの高輝度と 設置自由度向上を実現 液晶データ プロジェ ク タ …
ソ ニーは、 6 ㎝未満のプロジェ ク タ ーと し て業界最高輝度の5 2 0 0 ルーメ ン/5 1 0 0 ルーメ ンを実
現し 、 かつレンズシフ ト 調整機能を備え、 設置の自由度を向上し た液晶データ プロジェ ク タ ー「
VPL-CX2 7 5 」 「 VPL-CW 2 7 5 」 をはじ め、 データ プロジェ ク タ ー計6 機種を発売し ます。
Sony Jap an | ニュ ースリ リ ース | 業務用カ メ ラ に装着し 映像・ 各種信号の長距離伝送が可…
ソ ニーは、 業務用カ メ ラ /カ ムコ ーダーにカ メ ラ アダプタ ーを 装着し 、 接続ケーブルを介し てカ メ
ラ コ ント ロールユニッ ト と 接続するこ と で映像・ 各種信号の長距離伝送を可能にし 、 ラ イ ブカ メ
ラ システムを構築可能なカ メ ラ アダプタ ーシステムを2 機種発売し ます。 本システムは、 業務用H
Dカ メ ラ 「 HXC-D7 0 」 、 メ モリ ーカ ムコ ーダー「 PM W -5 0 0 /3 5
Sony Jap an | ニュ ースリ リ ース | 新開発のEマウント 電動ズームレンズを搭載 レンズ交換…
ソ ニーは、 Eマウント システムを採用し 、 総画素数1 6 7 0 万画素APS-Cサイ ズのセンサーを搭載
し た、 レンズ交換式 業務用NXCAM カ ムコ ーダー「 NEX-EA5 0 JH」 を発売し ます。
Sony Jap an | ニュ ースリ リ ース | 幅広い映像制作をサポート するXDCAM HD4 2 2 シリ ー…
1 2 3 4 5 6 7 8 9 1 0 次へ>>
video
OUT BNC ピン HD HDM I ソ ニー IN 映像 Unlim ited 端子 サービス
記録 ミ ニ 対応 SD DVD GENLOCK Vp ジャ ッ ク 可能
:
:
Computers don’t understand:
• Layout of a page
• Meaning of text fragments
• The entities & facts we’re
looking for
Advertising & visual techniques
are very confusing!
Errors in
extracted data
8. What is the name of the hotel?
“Essex House Hotel and Suites from $154 USD”
Where is the hotel located? >60 Paris’s in the world
“This Hilton hotel in Paris looks soooo nice;))”
Informal language
“Cancun is a MUST! Check this... Hotel Ocean Spa
Cancun 4d 3N w/2 adults from $199 usd”
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 8
INFORMATION EXTRACTION FROM UNSTRUCTURED TEXT
CHALLENGING TASK BECAUSE COMPUTERS CAN’T READ
• Extraction ambiguity
• Structure ambiguity
• Reference ambiguity
Errors in
extracted data
9. 25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 9
COMBINING DATA
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
10. 25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 10
… AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
11. Finding You on the Internet
Input: name, address(es), phone number(s), email address(es)
How to find your on-line accounts (twitter, ebay, facebook, runkeeper, …)
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 11
BACK TO FRAUD DETECTION
Persons
ByNameFinder
ByLocationFinder
KnownAccount
Enumerator
other
PersonUpdater
Person
data
Person Pipeline
ProfileExtractor
PhotoExtractor
MsgExtractor
AccountPersister
other
Account
data
Twitter
Accounts
Account Pipeline
EmailExtractor
PhoneExtractor
Language
Extractor
other
MsgPersister
Message
data
Message Pipeline
attributes
Experiment:
• 22 sign up subjects
• 12 with / 10 without
• 15 iterations
Avg 200 candidates
11 out of 12 found
• ISZW : 85 subjects
Candidate
accounts
Additional
info found
12. • All activities involved in coupling and integration of
information systems
Data exchange, conversion, information extraction, integration,
analysis, cleaning, evolution, migration, etc.
• Focus: “in an imperfect world”
Structural heterogeneity, data conflicts, semantic duplicates,
incompleteness, inexactness, ambiguity, errors, etc.
• Clean correct data is only a special case
• Treat data quality problems as a fact of life,
not as something to be repaired afterwards
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 12
RESEARCH FOCUS: DATA INTEROPERABILITY
13. 25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 13
MOST DATA QUALITY PROBLEMS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Example: semantic duplicates
14. Looks like ordinary database
Several “possible” answers or approximate answers
to queries
What I showed is discrete uncertainty only;
continuous uncertainty possible
Uncertainty orthogonal to data model
Relational (SQL) / XML (XPath) / RDF (SPARQL)
/ Reasoning (DataLog)
Important: Scalability (big data!)
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 14
IMPORTANT TOOL: PROBABILISTIC DATABASE
15. Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
Answer: 106
Analyst only bothered with
problems that matter
Risk = Probability * Impact
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis
INDETERMINISTIC DEDUPLICATION
QUERYING AND RISK ASSESSMENT
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Risk of substantially
wrong answer
15
16. Web harvesting: layout/navigation/extraction ambiguity
Possible values with probabilities and dependencies
Information extraction: extr/structure/ref ambiguity
Possible values with probabilities and dependencies
Candidate accounts in finding you on the internet
Possible (PersonID,AccID) pairs with probabilities
Associated extracted data with dependencies
Combining / coupling all this data
Just more possibilities and dependencies
Extraction of indicators = querying
Probabilistic indicators: Possible values with probabilities
Risk analysis and data mining
It’s just statistics; they can easily work with probabilistic data
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 16
PROBABILISTIC DATABASES IN FRAUD DETECTION
17. 25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 17
PUTTING IT ALL TOGETHER
Person/C
ompany
data
Web / Social
media
Probabilistic
Database
OSINT
harvester
Interpretation
Combination
Indicator
extraction
Fraud Risk
Analysis
Raw
Evidence
Make data quality and
trust issues explicit as
uncertainty in data
Adapted to
probabilistic indicators
Batch-wise
autonomous
harvesting
/ monitoring
18. Although data is public, one cannot use it for anything!
Cooperation with ethicist: Aimee van Wynsberghe
Generic guidelines for working with social network data
To use or not to use: guidelines for researchers
using data from online social networking sites
van Wynsberghe, A. and Been, H. and van Keulen, M. (2013)
Value trade-off
People investigated
People whose account is false positive
The ISZW
All Dutch citizens
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 18
INTERMEZZO ON ETHICS
19. OSINT additional data source with traces close to the real
world … but hard to extract and produces less quality data
OSINT requires more automation, autonomy and
robustness
Modeling data quality problems as uncertainty in data
Probabilistic database approach for scalability
In terms of the V’s of Big Data
Volume
Velocity
Variety
Veracity
25 Feb 2015Dealing with poor data quality of OSINT data in fraud risk analysis 19
CONCLUSIONS
my main object of study
(while not forgetting
about the other two)
Notas del editor
Asbestos interesting: protection measures not observable in data. inspection will work. But here we like to optimize the inspections.
They already do this for “dossier analysis” on individual basis
With OSINT data, this problem of semantic duplicates is enormous .,..
Also illustration on a web harvesting set-up
This is only Twitter, but if done with more social media accounts, an uncareful tweet may help to resolve, say, a facebook account.
Notice that all these are “tables”
Isn’t this nice: all these data quality problems are now in one form, readily usable.
Many data quality problems need not even be solved!
Although requesting welfare support is not really by choice, receiver is not obliged to do so => By requesting welfare support, someone voluntarily gives up some privacy to allow the government to investigate if he rightfully does so.