16. The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!
17. The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?
18. The Sad
“
本會非常歡迎各位來連結「國語辭典」,但是
本會目前只開放以超連結 (hyperlink) 的方式與
國語辭典 首頁 連結,至於其他方式本會並未對
外開放授權。若還有疑問或建議,歡迎來信。
⧸/教育部國語推行委員會〈有關授權〉
23. g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps…
35. Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like ⿰亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所
41. Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here
45. Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!
47. User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Volume key on Android resizes fonts (@ivan)
• Top Request: Taiwanese Bân-lâm-gi
48.
49. Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories
51. Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all entries
• Licensed under CC-BY-ND 3.0
52. Not-so-good Parts
• Entries are in non-bookmarkable <iframe>s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric
61. Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!
62. PgREST: MongoLab API Server
• GET /collections/table_or_view
• q=&c=true&f=&fo=true&s=&sk=&l=
curl $LY/collections/bills?q={"proposal.0":"吳育昇"}
curl $MOE/collections/entries?q={"部首":"一"}&c=1
• PUT /collections/table_or_view
66. Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions