2024: Domino Containers - The Next Step. News from the Domino Container commu...
ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment.
1. Linked Enterprise Data
LEVERAGING THE SEMANTIC WEB STACK
IN A CORPORATE ENVIRONMENT
ISWC 2012 – BOSTON
FABRICE LACROIX – LACROIX@ANTIDOT.NET
1
Copyright Antidot™
2. Antidot – who we are
French-based Software Vendor
Since 1999 | Paris, Lyon, Aix-en-Provence
Information access | Data management
Mission: Provide our customers with innovative
customizable solutions that help them create
value with their data, and make their employees
more aware and efficient.
2
Copyright Antidot™
5. Structured data
CRM, ERP, directory
knowledge bases
business applications (production, support)
5
Copyright Antidot™
6. IS are bloated
1 practice => 1 need => 1 application => 1 silo
Information system is driven by the process
Data are numerous, various and scattered
6
Copyright Antidot™
8. Solutions and workarounds
Enterprise Search brings little value to users
Document oriented
Does not solve real business problems
Google like Verity like
8
Copyright Antidot™
10. What we want
ERP
CRM
Production
LDAP
ECM
Support Files 10
Copyright Antidot™
11. Changing the paradigm
Switching from an application view to a
data centric way of thinking.
11
Copyright Antidot™
12. Bring out the implicit
Build the Giant Enterprise Graph
12
Copyright Antidot™
13. LED
Linked Enterprise Data
application of the Semantic Web technologies
and Linked Data principles to the enterprise
infrastructure
13
Copyright Antidot™
14. What works for the Web…
Federating silos on the Web
http://www.w3.org/People/Ivan/CorePresentations/RDFTutorial/Slides.html#(102)
14
Copyright Antidot™
15. …can’t always be used
in corporate IS
Legacy apps can’t be "Sparql’ed"
80% un- or semi- structured data don’t fit in the model
as such
Defining vocabularies/ontologies for silos is too
complex and expensive
Don’t want RDF per se but valuable information
External data is available in XML/JSON through Web
Services
Staff trained for RDB, XML, Web apps.
No Risk and stability strategy: SemWeb technology
considered as new and immature
15
Copyright Antidot™
16. The RDF/storage approach
Setting up a global RDF repository does not
work either
ITs are afraid by the "RDF everywhere" activists
16
Copyright Antidot™
17. Semantic Web technology
still is the right solution
in corporate environment
BUT it is not an aim
JUST use it
as a means
17
Copyright Antidot™
18. Just do it
Think of it as a stream paradigm
build new objects using existing data
without interfering with the existing infrastructure
with SemWeb somewhere under the hood
18
Copyright Antidot™
19. Enterprise Graph HowTo
Construct the graph
generate triples from data
create triples from documents
Leverage the graph
enrich
infer
Browse the graph
select resources
build objects
Trash the graph
19
Copyright Antidot™
20. How: extract & normalize
Harvest and normalize
as in an ETL
fetch, clean, transform…
normalize records (names, IDs) to prepare the
linking step
For databases
db2triples : an RDB2RDF implementation by
Antidot (open source, W3C validated)
20
Copyright Antidot™
21. How: semantize
Don’t transform everything in RDF
cherry-pick a subset of interesting fields for
each object and create their RDF triples
counterpart
interesting == needed for linking or inferring
Semantize
21
Copyright Antidot™
22. How: semantize
Triples generation
Be smart: avoid upfront ontology design, use
small vocabularies
Be pragmatic: transform XML tags and field
names to predicates
Be agile: only insert what you need. And when
you need more, add more.
Semantic Web fuels the modeling, linking
and information building process
22
Copyright Antidot™
23. Enterprise Graph HowTo
Construct the graph
generate triples from data
create triples from documents
Leverage the graph
enrich
infer
Browse the graph
select resources
build objects
Trash the graph
23
Copyright Antidot™
24. How: semantize
Unstructured documents
Extract metadata and transform them as
needed to RDF.
➡ Ex: author =>dc:creator
Use of text-mining to extract named entities:
people, organizations, products…
➡ generate those entities list using the data sources:
directory for employees, CRM for companies and
people, ERP for products
➡ create triples like doc_URI quotes entity_URI
24
Copyright Antidot™
25. How: semantize
Unstructured documents
Compare documents using various and
dedicated algorithms
➡ is the same
➡ is included
➡ is similar
➡ is related
Generates new triples
➡ create triples like
<docA>is_sub_version_of<docB>
25
Copyright Antidot™
26. Enterprise Graph HowTo
Construct the graph
generate triples from data
create triples from documents
Leverage the graph
enrich
infer
Browse the graph
select resources
build objects
Trash the graph
26
Copyright Antidot™
27. How: enrich
Enrich the graph
run specific algorithms to generate more links
and triples (classifiers, topic detection, …)
insert external data gathered from the LOD or
other external datasets or APIs
27
Copyright Antidot™
28. How: infer
Create new knowledge
add rules according to your needs
IF a coworker is quoted in documents
AND this coworker belongs to a business unit
THEN the business unit is bound to the documents
28
Copyright Antidot™
29. Enterprise Graph HowTo
Construct the graph
generate triples from data
create triples from documents
Leverage the graph
enrich
infer
Browse the graph
select resources
build objects
Trash the graph
29
Copyright Antidot™
30. How: build
Build
select resources corresponding to objects
seeds (using Sparql queries)
for each seed, follow links smartly in order to
create basic objects
Build
30
Copyright Antidot™
31. How: build
Finalize
decorate the new knowledge objects with data
set apart (not loaded in the triplestore)
now we have rich user-actionable objects
Build Finalize
31
Copyright Antidot™
32. Enterprise Graph HowTo
Construct the graph
generate triples from data
create triples from documents
Leverage the graph
enrich
infer
Browse the graph
select resources
build objects
Trash the graph
32
Copyright Antidot™
33. How: expose
Make the new information available to
users and to the entire IS
Relational DB
Enrich
Harvest Semantize
RDF Triplestore
(Linked Data)
Normalize Classify
Annotate
Indexation AFS search
engine
33
Copyright Antidot™
34. Conclusion
It works!
The triples we create and the inference rules
we add are dictated by the goal / application
➡ usage and value oriented
We benefit from the lazy-flexible-dynamic
modeling of RDF-RDFS-OWL
➡ we are agile
What matters is the graph. But the graph is
not the triplestore
➡ storage independent
34
Copyright Antidot™
35. There’s an app for that
Antidot Information Factory
a software solution designed specifically
to leverage structured and unstructured data
enable large-scale processing of existing data
automate publishing of enriched or newly
created information.
Harvest Normalize Semantize Enrich Build Expose
35
Copyright Antidot™
36. The Giant Enterprise Graph
Now we have a path to let SemWeb enter
the enterprise
36
Copyright Antidot™
37. Discuss
Understand
Learn
Exchange
www.antidot.net
info@antidot.net
THANKS FOR YOUR ATTENTION
QUESTIONS?
37
Copyright Antidot™
Notas del editor
Our information system, like any other corporate IS is blossoming with of all type of information. Most of it this information is UNstructured.
And part of it is structured : mostly due to relational database storage underlying business applications.This is applications we run internally: CRM, ERP, Support tracking, …
Many approaches have been developed to solve this problem of isolated silos.Most of them only apply to structured data (BI, MDM).And in most cases they entail a long and costly deployment process and make the system more complex.
Enterprise search is not a solution. And we know that for sure since we are a leading vendor in the realm of search solutions.The problem is related to the very nature of current search engines :- they are document oriented : they read documents, they index documents, they reply documents.
This is what we want: agile information, meshed, merged, enriched.
What you see is not data mashup! Not just data put side by side.Some information you see here need advanced processing that can not be done on the fly.
The solution is to change the paradigm: forget the applications and the APIs.Just look at the data.
Weneed to create the Enterprise graph
There is a solution:one that has been thought and designed for the Web.If it works for the Web, it should work for youand us.
The architecture for integrating data on the Web from various silos relies on a federated principle where a query is synchronously distributed over the sources through SPARQL endpoints exposed by each of them.This approach presents many scientific and technological challenges but considering the rationale behind the Web of Data and the need to work in the gigantic open Web space, this seems to be the only reasonable way to make it work.
Though theoretically correct, this approach is not applicable to the corporate IS for a large variety of reasons:• The corporate information system is built with numerous legacy or closed applications that cannot be adapted or extended with Sparql endpoints• The enterprise information realm is made up at 80% of unstructured or semi-structured data that cannot fit in the model as such.• Enterprises do not want access to raw data in RDF format. They want to reap valuable information derived from the data, which requires large and complex computations to create these new informational objects.• The bottom-up approach of mapping silos and their data to RDF to fit the model requires an enormous work for defining vocabularies or ontologies for each source, which is a too heavy investment.• Companies dream of seamlessly integrating external data to leverage their internal information. But this external data is mostly available in XML or JSON through Web Services, and not yet in RDF, so that using Sparql as a way to query and integrate does not make sense.• ITdepartments have invested heavily in their “relational database for storing / XML for exchanging / Web apps for accessing” infrastructure. Their staffs are trained for this paradigm. They lack in-house skills for integrating the graph-way-of-thinking.• Stability matters most and Semantic Web technology is unknown, considered as new and immature: CIOs are not ready to take the risk of adding load and technological uncertainty on systems that are critical to the company for its daily business operations.
Does not work because process: modeling, know-how technology: performance, scalability enterprise don’t care about technology, especially if new one.
We tailor the Normalize process by aligning fields content in order to mesh data coming from different sources (such as records from a CRM and an ERP).R2RML and Direct Mapping compliant module named db2triples.
“Why do we transform only a subpart of the harvested data in RDF and what do we do with the rest of it?” Indeed, not to mention the fact that text documents are not graph friendly, as stated above we only transform a selected part of the structured data into RDF:From a technical standpoint we don’t feel like the technology is mature and stable enough to proceed differently. In industrial projects, millions of seed objects are regularly extracted from the sources (invoices, clients, files, etc.), each having tens of fields. And having billions of triples doesn’t scale well in available triplestores.Transforming only a subpart of the data largely simplifies the task of choosing the predicates, hence reinforces the choice of using many small available vocabularies instead of big ontologies.The data that is not transformed to RDF is stored by Information Factory for later use during the Build step.
Unstructured documents like office files, PDF files or emails content don’t fit the RDF formalism and cannot be linked to the graph as such.Extra work is necessary: First, we transform available metadata like document name, author, creation date, sender and receivers for a mail, subject and so forth into RDF.Then, we use text-mining technology to extract named entities like people, organizations, products, etc. from the documents. These entities lists are generated using different sources of the enterprise: directories, CRM or ERP are providing people and company names, while products are listed in ERPs or taxonomies.
And last, we run various specific algorithms designed to do document versus document comparison to detect duplicates, different versions of the same document, inclusions, semantically related ones, etc. Each of these relations is inserted in the graph with an appropriate predicate.
It is like cooking: the rules are your own personal touch. Rules depend on the information and knowledge you want to create by inferring on the graph.
We created the graph by inserting basic triples. Then we grew the graph with enriching and inferring.Now it is time to extract the information we need.For this, we first select the resources we look for.Then we follow some links to grab the information and create basic objects.
We agree we all would like to see those technologies invading the information systeml.We would like to put these stickers on this beautiful zSeries mainframe. But what does it mean? How can we do that?