SlideShare una empresa de Scribd logo
1 de 22
Poio API: a CLARIN-D curation project for
language documentation and language typology
Peter Bouda
Centro Interdisciplinar de Documentação Linguística e Social
pbouda@cidles.eu
Overview
● Existing infrastructure and workflows
● Poio API and CLASS within CLARIN
● GrAF and TEI
● Poio API
● GrAF as pivot structures (IGT)
● GrAF for retro-digitization (Dictionary)
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
● Elan: EAF, MPEG, WAV
● Toolbox: TXT, XML, WAV
● Arbil: IMDI/CIMDI („Component MetaData
Infrastructure“)
● Praat: XML, WAV
● ...
● No standards for tier hierarchies, tier names or
annotation schemes
● Efforts in ISOcat
Interlinear Glossed Text
CLARIN
GrAF
● GrAF: Graph Annotation Framework
● ISO 24612: Language resource management - Linguistic
annotation framework (LAF)
● Started as stand-off version of XCES
● API and representation as data structures, not a file format
● GrAF/XML as XML representation
● Used for the MASC of the ANC
● Nodes, edges, regions, annotations, feature structures
GrAF entities
GrAF structure
GrAF-XML
<node xml:id="words..W-Words..na23">
<link targets="words..W-Words..ra23"/>
</node>
<region anchors="780 1340" xml:id="words..W-Words..ra23"/>
<edge from="utterance..W-Spch..n8" to="words..W-Words..na23"
xml:id="ea23"/>
<a as="words" label="words" ref="words..W-Words..na23"
xml:id="a23">
<fs>
<f name="annotation_value">so</f>
</fs>
</a>
Why we use GrAF
● No inline markup
● Radical stand-off approach
– Easier to share and manage data
– Preferred solution to archive cultural heritage
– Ideal for sparse annotations
● Existing code: Java and Python
● API vs. XQuery
● The beauty of annotation graphs
Poio API
●
Think of GrAF as an assembly language for linguistic annotation; then
Poio API is a libray to map from and to higher-level languages
● Subset of GrAF to represent tier based annotation
– Interlinear glossed text (IGT)
● Filters and filter chains for search
● Plugin mechanism for file formats
– Mapping semantics: tiers and annotations to nodes and edges
● Meta-data for additional information (tier types etc.)
● Efforts to map between TEI and GrAF
– Poio API supports IGT, next step is dictionaries and lexica
– Retro-digitized dictionary data at University of Marburg are published as GrAF files
– We want to publish as TEI
A basic converter in Poio API
parser = poioapi.io.wikipedia_extractor.Parser("Wikipedia.xml")
writer = poioapi.io.graf.Writer()
converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()
converter.write("Wikipedia.hdr")
A parser for CSV files
class CsvParser(poioapi.io.graf.BaseParser):
def get_root_tiers(self):
pass
def get_child_tiers_for_tier(self, tier):
pass
def get_annotations_for_tier(self, tier, annotation_parent=None):
pass
def tier_has_regions(self, tier):
pass
def region_for_annotation(self, annotation):
pass
def get_primary_data(self):
pass
Example: Analysis of CSV data
Example: Analysis of CSV data
http://nbviewer.ipython.org/urls/raw.github.com/pbouda/notebooks/master/Diana%2520Hinuq%25
Retro-digitization of dictionaries
● From scan to .doc to XML to DB to GrAF
● Radical stand-off approach for unsupervised
collaboration
● Dictionaries as cultural heritage texts
● GrAF as primary publication format
● Connectors to brat and TEI
Analysis of the data
● Spanish as pivot language, subset of bodypart terms
● Converting GrAF to networkx graph
● Nodes are heads, translations, etc.
● Head and translation connected via edges if they appear
in one entry
● Merge of graphs
● Count of paths of length 2 between spanish heads
● Python writes JSON graph, visualized with D3.js
D3 visualization
http://www.peterbouda.eu/bodyparts/index_bodyparts.html
Thank you for your attention!
pbouda@cidles.eu
Links
Clarin curation project:
http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr
Poio:
http://media.cidles.eu/poio/
GrAF:
http://www.xces.org/ns/GrAF/1.0/

Más contenido relacionado

La actualidad más candente

A Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebA Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebOlaf Hartig
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force statusLDBC council
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial IntroductionSakthi Dasans
 
Introducing The R Software
Introducing The R Software  Introducing The R Software
Introducing The R Software Kamarul Imran
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2Dimitris Kontokostas
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
R programming language: conceptual overview
R programming language: conceptual overviewR programming language: conceptual overview
R programming language: conceptual overviewMaxim Litvak
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...GUANGYUAN PIAO
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data scienceSovello Hildebrand
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programmingNimrita Koul
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
Exposing relational database as rdf
Exposing relational database as rdfExposing relational database as rdf
Exposing relational database as rdfShakil Ahmed
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-iDr. Awase Khirni Syed
 

La actualidad más candente (20)

R programming
R programmingR programming
R programming
 
A Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebA Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the Web
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 
R Programming
R ProgrammingR Programming
R Programming
 
R Introduction
R IntroductionR Introduction
R Introduction
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Introducing The R Software
Introducing The R Software  Introducing The R Software
Introducing The R Software
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
R programming language: conceptual overview
R programming language: conceptual overviewR programming language: conceptual overview
R programming language: conceptual overview
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
R programming
R programmingR programming
R programming
 
Poster
PosterPoster
Poster
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Exposing relational database as rdf
Exposing relational database as rdfExposing relational database as rdf
Exposing relational database as rdf
 
R crash course
R crash courseR crash course
R crash course
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 

Destacado

Best episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developerBest episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developerPeter Bouda
 
Smart Pen Presentation
Smart Pen PresentationSmart Pen Presentation
Smart Pen Presentationsusanvo_lavc
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysisPeter Bouda
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisPeter Bouda
 
Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...Peter Bouda
 
Transmision
TransmisionTransmision
Transmisionpailooot
 
Sci cafe humangenome&health
Sci cafe humangenome&healthSci cafe humangenome&health
Sci cafe humangenome&healthToby Rossman
 
Parker catalogue 2012
Parker catalogue 2012Parker catalogue 2012
Parker catalogue 2012PeterRamy
 
Product in theory and practice
Product in theory and practiceProduct in theory and practice
Product in theory and practiceRavi Chandegara
 
RxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScriptRxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScriptViliam Elischer
 
Data models in Angular 1 & 2
Data models in Angular 1 & 2Data models in Angular 1 & 2
Data models in Angular 1 & 2Adam Klein
 
Top Secret: Large-Scale SPA
Top Secret: Large-Scale SPATop Secret: Large-Scale SPA
Top Secret: Large-Scale SPAAnderson Braz
 
Cycling for noobs
Cycling for noobsCycling for noobs
Cycling for noobsSteve Lee
 
Development By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo EditionDevelopment By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo EditionAnthony Ferrara
 

Destacado (20)

Best episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developerBest episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developer
 
Smart Pen Presentation
Smart Pen PresentationSmart Pen Presentation
Smart Pen Presentation
 
Noord januari 2013
Noord januari 2013Noord januari 2013
Noord januari 2013
 
Multimiedia project
Multimiedia projectMultimiedia project
Multimiedia project
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
My Presentation
My PresentationMy Presentation
My Presentation
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...
 
Transmision
TransmisionTransmision
Transmision
 
Sci cafe humangenome&health
Sci cafe humangenome&healthSci cafe humangenome&health
Sci cafe humangenome&health
 
Parker catalogue 2012
Parker catalogue 2012Parker catalogue 2012
Parker catalogue 2012
 
Product in theory and practice
Product in theory and practiceProduct in theory and practice
Product in theory and practice
 
Pompa sentrifugal
Pompa sentrifugalPompa sentrifugal
Pompa sentrifugal
 
RxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScriptRxJS - The Reactive extensions for JavaScript
RxJS - The Reactive extensions for JavaScript
 
Data models in Angular 1 & 2
Data models in Angular 1 & 2Data models in Angular 1 & 2
Data models in Angular 1 & 2
 
Top Secret: Large-Scale SPA
Top Secret: Large-Scale SPATop Secret: Large-Scale SPA
Top Secret: Large-Scale SPA
 
Cycling for noobs
Cycling for noobsCycling for noobs
Cycling for noobs
 
01 - Git vs SVN
01 - Git vs SVN01 - Git vs SVN
01 - Git vs SVN
 
Simple testable code
Simple testable codeSimple testable code
Simple testable code
 
Development By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo EditionDevelopment By The Numbers - ConFoo Edition
Development By The Numbers - ConFoo Edition
 

Similar a Poio API: a CLARIN-D curation project for language documentation and language typology

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stackFliptop
 
.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep JoshiSpiffy
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Getting Started with PHP Extensions
Getting Started with PHP ExtensionsGetting Started with PHP Extensions
Getting Started with PHP ExtensionsMichaelBrunoLochemem
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)lennartkats
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationTim Burks
 

Similar a Poio API: a CLARIN-D curation project for language documentation and language typology (20)

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
 
.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi.NET 4 Demystified - Sandeep Joshi
.NET 4 Demystified - Sandeep Joshi
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
DPFManager workshop
DPFManager workshopDPFManager workshop
DPFManager workshop
 
Getting Started with PHP Extensions
Getting Started with PHP ExtensionsGetting Started with PHP Extensions
Getting Started with PHP Extensions
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code Generation
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

Poio API: a CLARIN-D curation project for language documentation and language typology

  • 1. Poio API: a CLARIN-D curation project for language documentation and language typology Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
  • 2. Overview ● Existing infrastructure and workflows ● Poio API and CLASS within CLARIN ● GrAF and TEI ● Poio API ● GrAF as pivot structures (IGT) ● GrAF for retro-digitization (Dictionary)
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 8. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 11. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 12. Why we use GrAF ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● API vs. XQuery ● The beauty of annotation graphs
  • 13. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation – Interlinear glossed text (IGT) ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Meta-data for additional information (tier types etc.) ● Efforts to map between TEI and GrAF – Poio API supports IGT, next step is dictionaries and lexica – Retro-digitized dictionary data at University of Marburg are published as GrAF files – We want to publish as TEI
  • 14. A basic converter in Poio API parser = poioapi.io.wikipedia_extractor.Parser("Wikipedia.xml") writer = poioapi.io.graf.Writer() converter = poioapi.io.graf.GrAFConverter(parser, writer) converter.parse() converter.write("Wikipedia.hdr")
  • 15. A parser for CSV files class CsvParser(poioapi.io.graf.BaseParser): def get_root_tiers(self): pass def get_child_tiers_for_tier(self, tier): pass def get_annotations_for_tier(self, tier, annotation_parent=None): pass def tier_has_regions(self, tier): pass def region_for_annotation(self, annotation): pass def get_primary_data(self): pass
  • 17. Example: Analysis of CSV data http://nbviewer.ipython.org/urls/raw.github.com/pbouda/notebooks/master/Diana%2520Hinuq%25
  • 18. Retro-digitization of dictionaries ● From scan to .doc to XML to DB to GrAF ● Radical stand-off approach for unsupervised collaboration ● Dictionaries as cultural heritage texts ● GrAF as primary publication format ● Connectors to brat and TEI
  • 19. Analysis of the data ● Spanish as pivot language, subset of bodypart terms ● Converting GrAF to networkx graph ● Nodes are heads, translations, etc. ● Head and translation connected via edges if they appear in one entry ● Merge of graphs ● Count of paths of length 2 between spanish heads ● Python writes JSON graph, visualized with D3.js
  • 21. Thank you for your attention! pbouda@cidles.eu