This document describes the development of an open source Mandarin Chinese dictionary called Moedict. It began as a crowdsourced effort to digitize an existing government dictionary. Volunteers scraped data, performed OCR on rare characters, and designed schemas. Apps and integrations were then built by other volunteers. The dictionary strives for open standards by using open licenses, structured JSON data, and assigning URIs to entries. Later improvements included in-browser word segmentation, mobile apps, and adding a Taiwanese bilingual dictionary. The project shows how open data and crowdsourcing can collaboratively develop public resources.
25. The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!
26. The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?
32. g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps
44. Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like ⿰⺅亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所
50. Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here
54. Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!
56. User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Simplified character lookup (@xiaofang)
• Top Request: Taiwanese Bân-lâm-gi
57.
58. Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories
60. Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all 20k entries
• Licensed under CC-BY-ND 3.0
61. Not-so-good Parts
• Entries are in non-bookmarkable <iframe>s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric
70. Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!
71. PgREST: MongoLab API Server
• GET /collections/table_or_view
• ?q=&c=true&f=&fo=true&s=&sk=&l=
curl $LY/collections/bills?q={"proposal.0":"吳育昇"}
curl $MOE/collections/entries?q={"部⾸首":"⼀一"}&c=1
• PUT /collections/table_or_view
78. Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions