SlideShare a Scribd company logo
1 of 79
Download to read offline
Transforming data silos into knowledge:
Early Chinese Periodicals Online (ECPO)
Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
Research data – Chinese periodical press
• First decades of the 20th century
• Understudied, but dominated the contemporary print market and
provide access to the "actual culture“ (R. Williams, 1961)
• Challenges:
• Physically dispersed, often poorly preserved
• Voluminous (full runs, daily, up to >30 years)
• Multi-generic and intellectually demanding
• Approach
• Multi-disciplinary team, >10 researchers
• Women and the Periodical Press in China’s Global
Twentieth Century: A Space of Their Own? Ed. by Joan
Judge, Barbara Mittler and Michel Hockx, Cambridge
University Press, 2018.
• Database
Early Chinese Periodicals Online (ECPO)
https://uni-heidelberg.de/ecpo
276 publications: 134 with items
>279.000 scans
40.936 issues: 46.931 articles, 20.532 images, 18.639 ads
Chart: Publication activity by year
Arnold and Hessel | ECPO Database
Opening the data silo
From static export to dynamic data service
• Output data using the Metadata Object Description Schema
(MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/
From static pre-rendered files to dynamic image service
• Implementation of International Image Interoperability
Framework (IIIF) Image API http://iiif.io/technical-details/
From separate names to cross-db agents service
• Identify agent, assign names, link to authorities, structure
information, feed data back to authority files (GND)
Agents Service
Arnold and Hessel | ECPO Database
47.245 agents, 163.408 occurrences, 15 ‘languages’
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Agents with references to authorities:
VIAF: 861
Wikidata: 821
GND: 662
Baidu: 6
DBpedia: 5
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Islington Corinthians F.C.:
- Leonard Bradbury
- Jack Braithwaite
- Alec Buchanan
- Pat Clark
- George Dance
- Cyril Longman
- Harry Lowe
- Richard Manning
- Albert (Eddie) Martin
- John Miller
- William Miller
- George Pearce
- Bert Read
- Johnny Sherwood
- Dick Tarrant
- Bill Whittaker
- Ted Wingfield
- J.K. Wright
Source: National Library Board
Singapore NewspaperSG,
accessed March 25, 2019,
http://eresources.nlb.gov.sg/new
spapers/Digitised/Article/straitsti
mes19371128-1.2.117.
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Towards full text
Arnold and Hessel | ECPO Database
https://uni-heidelberg.de/ecpo
Arnold and Hessel | ECPO Database
Expanding data: towards fulltext
• Manual typing not feasible
• Professional double-keying very expensive
• OCR often unusable
• Document: dense layout, normal segmentation fails
• Image: noisy, secondary copies with stains/scratches
• Characters: special characters (emphasis), handwriting
ca. 63% correctly
recognized
Segmentation - I
• Page segmentation (pattern recognition/computer vision)
• Analyze layout of page, use page-internal structures
• Identify semantic units
• Generate co-ordinates, relate them to items, store in DB
Segmentation - II
• Page segmentation (crowdsourcing)
• Pilot project with Pallas Ludens GmbH
• Let the crowd help analyzing the pages
• Identify and label four item types:
− image/drawing
− article
− advertisement
− additional information
• Supervised
• Non-Chinese speaking community!
Processing
2. Page segmentation (computer vision/ocr)
Grouping semantic units
2. Page segmentation (crowdsourcing)
• drawing – correcting – grouping
Outcome of segmentation pilot
1. Page segmentation can be outsourced to expert crowd
• Requires supervision
• Advanced user interfaces (high usability, efficiency)
• Crowd should read Chinese (semantic grouping)
2. Jingbao 晶報 1919-21 completely segmented with qualified
boxes, issues of April 1919 with semantic units
3. Further processing:
• Partnership with Computational Knowledge Lab (知識計
算實驗室), Department of Engineering Science and
Ocean Engineering, Taiwan National University,
http://www.cklab.org/
• Seeking additional partners for collaboration!
Chinese Republican Periodicals –
Encoding full text in TEI
Arnold and Hessel | ECPO Database
Materiality issues
Mark-up: Different character sizes
<tagsDecl>
<rendition scheme="css" selector="body p">font-size:
100%;</rendition>
<rendition xml:id="half">font-size: 50%</rendition>
<rendition xml:id="double">font-size: 200%</rendition>
</tagsDecl>
<hi rendition="#double">女子之於男子</hi>
<hi rendition="#half">試觀西歐各國<lb/>名為男…
Mark-up: Emphasis
In Japanese: “emphasis dots” 圏点 (kenten) or 傍点
1. ◦ U+25E6 “open dot”
2. • U+2022 “filled dot”
3. ○ U+25CB “open circle”
4. ● U+25CF “filled circle”
5. ◎ U+25CE “open double-circle”
6. ◉ U+25C9 “filled double-circle”
7. △ U+25B3 “open triangle”
8. ▲ U+25B2 “filled triangle”
9. ﹆ U+FE46 “open sesame”
10. ﹅ U+FE45 “filled sesame”
https://drafts.csswg.org/css-text-decor-3/#text-emphasis-style-property
BUT: emphasis characters mixed with
punctuation, differentiation and exact recording is
HUGE workload
-> emphasis characters currently ignored
Mark-up: Spaces between some characters
<space unit="chars" n="1"/>
OR
<gap unit="char" extent="1"> </gap>
(with “ ” being U+3000)
OR
just use U+3000 without markup
TEI Example
Wrap-up
Arnold and Hessel | ECPO Database
From data silo towards open data
• Data collection = research data
• Enhance metadata
• Publishing information, content analysis (keywords)
• Separation of meta-/data from user interface
• FAIR Prinzipien
• DOI records for publications (in progress), connect database
to library catalogs
• Publish material and metadata Open Access, images,
publication metadata, and item metadata (article, image, ad)
• Basic data API (MODS XML)
open up IIIF manifests and Agents data (planned)
• Publish metadata on heiDATA/Dataverse (Summer)
Arnold and Hessel | ECPO Database
Wrap-up
• Provide different ways to access data via frontend:
• Search (all metadata and annotations)
• Browse chronological (calendar)
• Browse/search agents / keywords
• Categories of publications
• Agents service (biographic data)
• cross-db record curation, connect persons with authorities
• plan (2019): add missing agents or names to GND, pull additional
data from authorities, develop agents API
• Page segmentation – crowdsourcing possible, grouping
requires Chinese, new tool creates web-annotations – seeking
partner for automatic page analysis
• Text – plan: process segments, generate full text, store TEI
XML, crowd-based editing
ECPO in a larger context
• Content expansion
• Early western publications printed in China
• Co-operation with Univ. Erlangen: Agents
• ECPO as data platform
• for storing, enhancing, accessing, sharing „grey“
material from the CATS Library
• Outreach/ Communities
• DH-d working group Newspaper/Journals, OCR-d,
Transkribus/READ
• Connect with FID Asien (CrossAsia), Non-Latn scripts
interest group, TEI East Asia SIG
• Long-term repository: University Library, HeiDATA/HeidICON
Arnold and Hessel | ECPO Database
Contact
Matthias Arnold – Lena Hessel
Heidelberg Centre for Transcultural Studies | HCTS
Karl Jaspers Centre
Voßstr. 2 | Building 4400 | Room 005b
69115 Heidelberg, Germany
Phone: +49 - 6221 - 54 4094
eMail: matthias.arnold@uni-hd.de
Web: http://tinyurl.com/matthias-arnold

More Related Content

Similar to Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)

Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Digitised Manuscripts to Europeana
 
How do you know what you are looking for?
How do you know what you are looking for?How do you know what you are looking for?
How do you know what you are looking for?
Shawn Day
 
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
Matthias Arnold
 
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
Jim "Brodie" Brazell
 

Similar to Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO) (20)

Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
 
Deconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationDeconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communication
 
How do you know what you are looking for?
How do you know what you are looking for?How do you know what you are looking for?
How do you know what you are looking for?
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
 
Open Research Knowledge Graph (ORKG) - an overview
Open Research Knowledge Graph (ORKG) - an overview   Open Research Knowledge Graph (ORKG) - an overview
Open Research Knowledge Graph (ORKG) - an overview
 
Collecting the organizational scholarly record
Collecting the organizational scholarly recordCollecting the organizational scholarly record
Collecting the organizational scholarly record
 
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
Tuesday 5 May: IIPC activities, Olga Holownia, IIPCTuesday 5 May: IIPC activities, Olga Holownia, IIPC
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
 
CAISE's Year in Informal STEM Education 2018
CAISE's Year in Informal STEM Education 2018CAISE's Year in Informal STEM Education 2018
CAISE's Year in Informal STEM Education 2018
 
Dimensões tecnopolíticas e econômicas da comunicação científica em transformação
Dimensões tecnopolíticas e econômicas da comunicação científica em transformaçãoDimensões tecnopolíticas e econômicas da comunicação científica em transformação
Dimensões tecnopolíticas e econômicas da comunicação científica em transformação
 
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
 
Upgrading the Scholarly Infrastructure
Upgrading the Scholarly InfrastructureUpgrading the Scholarly Infrastructure
Upgrading the Scholarly Infrastructure
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadh
 
Wikidata Introductory Workshop
Wikidata Introductory WorkshopWikidata Introductory Workshop
Wikidata Introductory Workshop
 
BIBFRAME on its way
BIBFRAME on its wayBIBFRAME on its way
BIBFRAME on its way
 
Module 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxModule 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptx
 
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
Periodicals and Newspapers in Database Projects of the Heidelberg Research Ar...
 
Enabling complex analysis of large scale digital collections
Enabling complex analysis of large scale digital collectionsEnabling complex analysis of large scale digital collections
Enabling complex analysis of large scale digital collections
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods
 
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
STEM 2.0: Transformational Thinking about STEM for School Board Members, Dec....
 

More from Matthias Arnold

Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
Matthias Arnold
 
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
Matthias Arnold
 

More from Matthias Arnold (7)

Ocr workshop ubhd 2020 10-15
Ocr workshop ubhd  2020 10-15Ocr workshop ubhd  2020 10-15
Ocr workshop ubhd 2020 10-15
 
Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
Republikzeitliche chinesische Presse – Crowdsourcing und andere Wege in Richt...
 
(Projekt)Ende gut – Alles gut? Benutzbarkeit – Verfügbarhaltung – Archivierung
(Projekt)Ende gut – Alles gut? Benutzbarkeit – Verfügbarhaltung – Archivierung(Projekt)Ende gut – Alles gut? Benutzbarkeit – Verfügbarhaltung – Archivierung
(Projekt)Ende gut – Alles gut? Benutzbarkeit – Verfügbarhaltung – Archivierung
 
Die Erschließung eines vielsprachigen bibliographischen Korpus: Der Turkologi...
Die Erschließung eines vielsprachigen bibliographischen Korpus: Der Turkologi...Die Erschließung eines vielsprachigen bibliographischen Korpus: Der Turkologi...
Die Erschließung eines vielsprachigen bibliographischen Korpus: Der Turkologi...
 
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
Videoannotationsdatenbank Pan.do/ra in der HRA ("Loosing my religion" - Kunst...
 
VRA Core 4 in Transcultural Studies - Adopting Core 4 XML in a DH Environment.
VRA Core 4 in Transcultural Studies - Adopting Core 4 XML in a DH Environment.VRA Core 4 in Transcultural Studies - Adopting Core 4 XML in a DH Environment.
VRA Core 4 in Transcultural Studies - Adopting Core 4 XML in a DH Environment.
 
Ziziphus/Tamboti
Ziziphus/TambotiZiziphus/Tamboti
Ziziphus/Tamboti
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)

  • 1. Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO) Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
  • 2. Research data – Chinese periodical press • First decades of the 20th century • Understudied, but dominated the contemporary print market and provide access to the "actual culture“ (R. Williams, 1961) • Challenges: • Physically dispersed, often poorly preserved • Voluminous (full runs, daily, up to >30 years) • Multi-generic and intellectually demanding • Approach • Multi-disciplinary team, >10 researchers • Women and the Periodical Press in China’s Global Twentieth Century: A Space of Their Own? Ed. by Joan Judge, Barbara Mittler and Michel Hockx, Cambridge University Press, 2018. • Database Early Chinese Periodicals Online (ECPO)
  • 4.
  • 7. 40.936 issues: 46.931 articles, 20.532 images, 18.639 ads
  • 8.
  • 9.
  • 10. Chart: Publication activity by year Arnold and Hessel | ECPO Database
  • 11. Opening the data silo From static export to dynamic data service • Output data using the Metadata Object Description Schema (MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/ From static pre-rendered files to dynamic image service • Implementation of International Image Interoperability Framework (IIIF) Image API http://iiif.io/technical-details/ From separate names to cross-db agents service • Identify agent, assign names, link to authorities, structure information, feed data back to authority files (GND)
  • 12. Agents Service Arnold and Hessel | ECPO Database
  • 13. 47.245 agents, 163.408 occurrences, 15 ‘languages’
  • 14. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 15. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 16. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 17. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 18. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 19. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 20. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 21. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 22. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF
  • 23. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND
  • 24. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF
  • 25. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  • 26. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  • 27. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike Agents with references to authorities: VIAF: 861 Wikidata: 821 GND: 662 Baidu: 6 DBpedia: 5
  • 28. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 29. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 30. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 31. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 32. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 33. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 34. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 35. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 36. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 37. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 38. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 Islington Corinthians F.C.: - Leonard Bradbury - Jack Braithwaite - Alec Buchanan - Pat Clark - George Dance - Cyril Longman - Harry Lowe - Richard Manning - Albert (Eddie) Martin - John Miller - William Miller - George Pearce - Bert Read - Johnny Sherwood - Dick Tarrant - Bill Whittaker - Ted Wingfield - J.K. Wright Source: National Library Board Singapore NewspaperSG, accessed March 25, 2019, http://eresources.nlb.gov.sg/new spapers/Digitised/Article/straitsti mes19371128-1.2.117.
  • 39. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 40. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 41. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 42. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 43. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 44. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 45. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 46. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 47. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 48.
  • 49. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 50. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 51. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 52. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 53. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 54. Towards full text Arnold and Hessel | ECPO Database
  • 56.
  • 57.
  • 58. Expanding data: towards fulltext • Manual typing not feasible • Professional double-keying very expensive • OCR often unusable • Document: dense layout, normal segmentation fails • Image: noisy, secondary copies with stains/scratches • Characters: special characters (emphasis), handwriting
  • 59.
  • 60.
  • 61.
  • 62.
  • 64. Segmentation - I • Page segmentation (pattern recognition/computer vision) • Analyze layout of page, use page-internal structures • Identify semantic units • Generate co-ordinates, relate them to items, store in DB
  • 65. Segmentation - II • Page segmentation (crowdsourcing) • Pilot project with Pallas Ludens GmbH • Let the crowd help analyzing the pages • Identify and label four item types: − image/drawing − article − advertisement − additional information • Supervised • Non-Chinese speaking community!
  • 66. Processing 2. Page segmentation (computer vision/ocr)
  • 67. Grouping semantic units 2. Page segmentation (crowdsourcing) • drawing – correcting – grouping
  • 68. Outcome of segmentation pilot 1. Page segmentation can be outsourced to expert crowd • Requires supervision • Advanced user interfaces (high usability, efficiency) • Crowd should read Chinese (semantic grouping) 2. Jingbao 晶報 1919-21 completely segmented with qualified boxes, issues of April 1919 with semantic units 3. Further processing: • Partnership with Computational Knowledge Lab (知識計 算實驗室), Department of Engineering Science and Ocean Engineering, Taiwan National University, http://www.cklab.org/ • Seeking additional partners for collaboration!
  • 69. Chinese Republican Periodicals – Encoding full text in TEI Arnold and Hessel | ECPO Database
  • 71. Mark-up: Different character sizes <tagsDecl> <rendition scheme="css" selector="body p">font-size: 100%;</rendition> <rendition xml:id="half">font-size: 50%</rendition> <rendition xml:id="double">font-size: 200%</rendition> </tagsDecl> <hi rendition="#double">女子之於男子</hi> <hi rendition="#half">試觀西歐各國<lb/>名為男…
  • 72. Mark-up: Emphasis In Japanese: “emphasis dots” 圏点 (kenten) or 傍点 1. ◦ U+25E6 “open dot” 2. • U+2022 “filled dot” 3. ○ U+25CB “open circle” 4. ● U+25CF “filled circle” 5. ◎ U+25CE “open double-circle” 6. ◉ U+25C9 “filled double-circle” 7. △ U+25B3 “open triangle” 8. ▲ U+25B2 “filled triangle” 9. ﹆ U+FE46 “open sesame” 10. ﹅ U+FE45 “filled sesame” https://drafts.csswg.org/css-text-decor-3/#text-emphasis-style-property BUT: emphasis characters mixed with punctuation, differentiation and exact recording is HUGE workload -> emphasis characters currently ignored
  • 73. Mark-up: Spaces between some characters <space unit="chars" n="1"/> OR <gap unit="char" extent="1"> </gap> (with “ ” being U+3000) OR just use U+3000 without markup
  • 75. Wrap-up Arnold and Hessel | ECPO Database
  • 76. From data silo towards open data • Data collection = research data • Enhance metadata • Publishing information, content analysis (keywords) • Separation of meta-/data from user interface • FAIR Prinzipien • DOI records for publications (in progress), connect database to library catalogs • Publish material and metadata Open Access, images, publication metadata, and item metadata (article, image, ad) • Basic data API (MODS XML) open up IIIF manifests and Agents data (planned) • Publish metadata on heiDATA/Dataverse (Summer) Arnold and Hessel | ECPO Database
  • 77. Wrap-up • Provide different ways to access data via frontend: • Search (all metadata and annotations) • Browse chronological (calendar) • Browse/search agents / keywords • Categories of publications • Agents service (biographic data) • cross-db record curation, connect persons with authorities • plan (2019): add missing agents or names to GND, pull additional data from authorities, develop agents API • Page segmentation – crowdsourcing possible, grouping requires Chinese, new tool creates web-annotations – seeking partner for automatic page analysis • Text – plan: process segments, generate full text, store TEI XML, crowd-based editing
  • 78. ECPO in a larger context • Content expansion • Early western publications printed in China • Co-operation with Univ. Erlangen: Agents • ECPO as data platform • for storing, enhancing, accessing, sharing „grey“ material from the CATS Library • Outreach/ Communities • DH-d working group Newspaper/Journals, OCR-d, Transkribus/READ • Connect with FID Asien (CrossAsia), Non-Latn scripts interest group, TEI East Asia SIG • Long-term repository: University Library, HeiDATA/HeidICON Arnold and Hessel | ECPO Database
  • 79. Contact Matthias Arnold – Lena Hessel Heidelberg Centre for Transcultural Studies | HCTS Karl Jaspers Centre Voßstr. 2 | Building 4400 | Room 005b 69115 Heidelberg, Germany Phone: +49 - 6221 - 54 4094 eMail: matthias.arnold@uni-hd.de Web: http://tinyurl.com/matthias-arnold