萌典與零時政府

JSON
{ "title": "萌",
"bopomofo": "ㄇㄥˊ",
"pinyin": "méng",
"definitions": [
{ "type": "名", "def": "草⽉⽊木初⽣生的芽。" },
{ "type": "名", "def": "事物發⽣生的開端或徵兆。" },
{ "type": "名", "def": "⼈人⺠民。" } ] }

PostgreSQL
CREATE TABLE moe ( "entry" JSON );
INSERT INTO moe VALUES ($$
{ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng",
"definitions": [ { "type": "名", "def": "草⽉⽊木初⽣生的芽。" },
{ "type": "名", "def": "事物發⽣生的開端或徵兆。" },
{ "type": "名", "def": "⼈人⺠民。" } ] } $$);
INSERT INTO moe VALUES ('這不是ㄓㄟㄙㄣˇ'); -- type error

PLV8
CREATE EXTENSION plv8;
CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$
return JSON.stringify( obj[key] );
$$ LANGUAGE plv8;
SELECT get_json_key(entry, 'bopomofo') FROM moe;
-- "ㄇㄥˊ"

plv8x: Operators
SELECT entry ~> '@bopomofo' FROM moe;
-- "ㄇㄥˊ"
SELECT '@bopomofo' <~ entry FROM moe;
-- "ㄇㄥˊ"
SELECT ~> 'new Date';
-- "2013-04-17T12:31:57.523Z"

plv8x: Modules
npm i -g uax11
plv8x -i uax11
plv8x -E 'require "uax11" .toFullwidth "méng"'
# ｍｅ́ ｎｇ
SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe;
-- "ｍｅ́ ｎｇ"

plv8x: Functions
plv8x -f 'text fullwidth(text)=uax11:toFullwidth'
plv8x -f 'text PINYIN_UPCASE(json)=:&0.pinyin.toUpperCase!'
SELECT fullwidth('TÂN Pek-tiong');
-- ＴＡ̂ Ｎ　Ｐｅｋ－ｔｉｏｎｇ
SELECT fullwidth( PINYIN_UPCASE(entry) ) FROM moe;
-- ＭＥ́ ＮＧ

Summary
• V8 : JavaScript engine
• PLV8 : Stored procedures in JavaScript
• plv8x : Package manager for PLV8
• Turns Node.js modules into SQL functions
• JSON expressions with ~> and <~
• Code reuse for browser + server + database !

Cutting out the Middleware
• Serve JSON API from SQL
• Shared models & validation code
• Put Business Logic into DB
• Perfect fit for Medium Data™

Hackath[0-3]+n
• 第零次動員戡亂黑客松
• 第壹次公地放領黑客松
• 第壹之壹次耕者有其田黑客松
• 第貳次九大建設黑客松
• 第參次客廳工廠黑客松

The Revised MoE Dictionary (1994)

The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!

The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?

⧸／教育部國語推行委員會〈有關授權〉
The Sad
本會非常歡迎各位來連結「國語辭典」，但是
本會目前只開放以超連結 (hyperlink) 的方式與
國語辭典首頁連結，至於其他方式本會並未對
外開放授權。若還有疑問或建議，歡迎來信。
“

.…and the Very Crazy
• 不需登入的網頁，會自動把你登出！

⧸／葉平〈還文於民〉
Yeh’s Ping, 2013.1.26.
所以我要響應零時政府 g0v.tw 的活動，來做
3du.tw，把字、詞、成語、定義、例句等等正
體中文資料，用開放的文字 API 釋放出來，加
上索引和搜尋的功能，讓任何想加值的個人或
公司都可以使用。
“

零時黑客
集體砍站事件

g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps

Finished in 24 hours!
Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,
Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…

粗略の共識
進擊の程式

Applications
• XUL Desktop App (@racklin)
• OS X Dictionary (@yllan)
• Windows 8 App (@wenpei)
• iOS Client (@tomjpsun, @jamessa, @pct)
• iOS Offline App (@zonble)

Integrations
• Rails API server (@albb0920)
• AngularJS Client+Server (@viirya)
• Chrome Extension (@tonytonyjan)
• Sublime Text plugin (@zonble)
• WinRT Component (@eriksk)

Fair Use
為非營利之教育目的，依著作權法第 50 條，
「以中央或地方機關或公法人之名義公開發表
之著作，在合理範圍內，得重製、公開播送或
公開傳輸。」此處轉換格式、重新編排的編輯
著作權(如果有的話)由 @kcwu 以 CC0 釋出。
“

CC0: Public Domain
除前述資料檔之外，本目錄下的所有其他檔
案，由作者唐鳳在法律許可的範圍內，拋棄該
著作依著作權法所享有之權利，包括所有相關
與鄰接的法律權利，並宣告將該著作貢獻至公
眾領域。
“

5 Stars of Open Data
1. ⊙☉ Open License
2. ↔ Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items

URI Endpoints
• https://moedict.tw/#⽂文字
• 3 APIs (for non-Unicode characters):
• /raw/⽂文字.json {[8ff0]}
• /uni/⽂文字.json ⿰⺅亻壯
• /pua/⽂文字.json U+F8FF0

Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like ⿰⺅亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所

Reaching the Fifth Star
1. ⊙☉ Open License
2. ↔ Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items

Chinese Segmentation
• Therearenowhitespacesbetweenwords
• Lots of heuristic algorithms
• Naive solution: Longest-token match
• Requires a large dictionary
• …wait, we just got one here

In-browser Implementation
{"4":"⼀一(丁不識|不⼩小⼼心|不扭眾|不拗眾|世之雄|世英名|丘⼀一壑|丘之
貉|串驪珠|之為甚|之謂甚|乾⼆二淨|了⼼心願|了百了|了百當|事無成|五
⼀一⼗十|⼈人之交|介不取|仍舊貫|代宗匠|代宗⾂臣|代巨擘|代楷模|代⾵風流|
代⿍鼎⾂臣|以當⼗十|以貫之|來⼀一往|來⼆二去|依舊式|個勁兒|個⼦子兒|個樣
兒|倡三歎|倡百和|偏之⾒見|傅眾咻|償宿願|元⼤大武|元復始|兵⼀一卒|⼑刀
⼀一割|⼑刀兩斷|⼑刀兩段|分⼀一毫|切從簡|切現成|切眾⽣生|刻千⾦金|⼒力承當|
勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|⼝口價兒|⼝口兩匙|⼝口咬
定|⼝口咬死|古腦兒|名半職|吐為快|吹⼀一唱|呼再諾|呼百應|呼百諾|命
嗚呼|哄⽽而上|哄⽽而散|哄⽽而起|哄⽽而集|唱⼀一和|唱三歎|唱百和|喫⼀一添|

Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here

/a/⽂文字.json
{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`⼈人類~`⽤用來~`表⽰示~`觀念~、
`記錄~`語⾔言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`⽂文書
~。","q":["`五代史~`平話~．`梁~`史~．`卷~`上~：「`您~`去~`攻
破~`宋~`州~，`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~，`便~`
發⽂文~`字~`來~`報~`我~。」","`警世通⾔言~．`卷~`⼗十~`三~．`三~`現
⾝身~`包⻯⿓龍圖~`斷~`冤~：「`有~`甚事~`煩惱~？`想~`是~`縣~`裡~`有
~`甚~`⽂文字~`不了~。」"]}],"p":"wén zì"}],"t":"`⽂文~`字~"}

Materialized View: 160k .json files
(@obra++)

Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!

User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Simplified character lookup (@xiaofang)
• Top Request: Taiwanese Bân-lâm-gi

Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories

Taiwan Bân-lâm-gi Common Dictionary
(MoE, 2011)

Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all 20k entries
• Licensed under CC-BY-ND 3.0

Not-so-good Parts
• Entries are in non-bookmarkable <iframe>s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric

Crowd-OCR for 154 glyphs, 2013.3.25.

Finished over lunch!
Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,
@YuanChao, @clkao, @MGDesigner, @gontera…

Database received, 2013.3.27.
• 詞⺫⽬目總檔.xls 詞⺫⽬目總檔.屬性對照.xls
• 釋義.xls 釋義.詞性對照.xls
• ⼜又⾳音.xls ⼜又⾳音.屬性對照.xls
• 近義詞對應.xls 反義詞對應.xls
• 詞彙⽅方⾔言差.xls 語⾳音⽅方⾔言差.xls
• 例句.xls

.…What about that extra request?
您好：
資料匯入目前大致無誤。不過，twblg 網頁上
的「華語檢索」，可以用「一乾二淨」找到閩
語典的「離離」條目，這個對照表似乎沒有在
Excel 檔中看到？
“

Well…
語言之間的對譯，不能盡然以詞彙對應，對不
夠深入了解的使用者來說，會讓他誤以為A語
言的X詞等於B語言的Y詞（並且這種呈現，會
被民眾認知為「教育部的辭典說的」）。
“

However…
因此華語對應這個欄位，我們是藏在系統中。
如果是民間的辭典編輯，會比較沒有這個負
擔，因此我這裡確實不能給，非常希望你們能
有辦法解決。
“

.…it’s all good.
好的，感謝您的提醒和協助。
目前從網頁以 Big5 範圍取出的華語條目，
共有 26274 筆對映。
在應用上，這部份我們會註明不屬於教育部
CC-BY-ND 的授權範圍。
“

Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!

PgREST: MongoLab API Server
• GET /collections/table_or_view
• ?q=&c=true&f=&fo=true&s=&sk=&l=
curl $LY/collections/bills?q={"proposal.0":"吳育昇"}
curl $MOE/collections/entries?q={"部⾸首":"⼀一"}&c=1
• PUT /collections/table_or_view

PgREST: Import/Export
pgrest dbname
export MOE=http://127.0.0.1:3000
curl -i -X PUT -H "Content-Type: text/csv"
--data-binary @uni/詞⺫⽬目總檔.csv $MOE/collections/entries
curl $MOE/collections/entries
# [{"主編號","1","屬性":"1","詞⺫⽬目":"⼀一","⾳音讀":"tsi̍t",
# "⽂文⽩白俗替":"替","部⾸首":"⼀一","部⾸首序":"001-00-01","⽅方⾔言差":""}]

PgREST: 3du.tw JSON in 48 lines
https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls
“

OSDC.tw, 2013.4.20.
· Bân-lâm-gi with cross-referencing and pronounciations!

g0v prehackath3n, 2013.5.23.
· English, French & German translations! (@a-tsioh)

g0v hackath3n, 2013.6.8.
· Hakka Dictionary with 14k entries! (@a-tsioh, @pcchen)

Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions

— Aaron Swartz, «Open Government»
When is Transparency Useful?
眾人為了共同目標聚在一起，才能做出改變，
科技人很難獨力完成。
衡量成功的標準，可以是有多少人的生命因你
獲得改善，而不只是有多少人看你架的網站。
“

⧸／北島〈回答〉
Thank you!
新的轉機和閃閃星斗，
正在綴滿沒有遮攔的天空。
那是五千年的象形文字，
那是未來人們凝視的眼睛。
“

萌典與零時政府

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (6)

Similar a 萌典與零時政府

Similar a 萌典與零時政府 (20)

Último

Último (20)

萌典與零時政府