PgREST: Node.js in the Database

PgREST
Node.js in the Database
Audrey Tang

PgREST is…
• JSON document store
• Running inside PostgreSQL
• Working with existing relational data
• Capable of loading Node.js modules
• Compatible with MongoLab’s REST API
• = LiveScript + PLV8 + plv8x + OneJS

JSON
{ "title": "萌",
"bopomofo": "ㄇㄥˊ",
"pinyin": "méng",
"definitions": [
{ "type": "名", "def": "草木初生的芽。" },
{ "type": "名", "def": "事物發生的開端或徵兆。" },
{ "type": "名", "def": "人民。" } ] }

PostgreSQL
CREATE TABLE moe ( "entry" JSON );
INSERT INTO moe VALUES ($$
{ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng",
"definitions": [ { "type": "名", "def": "草木初生的芽。" },
{ "type": "名", "def": "事物發生的開端或徵兆。" },
{ "type": "名", "def": "人民。" } ] } $$);

INSERT INTO moe VALUES ('這不是ㄓㄟㄙㄣˇ'); -- type error

PLV8
CREATE EXTENSION plv8;
CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$
return JSON.stringify( obj[key] );
$$ LANGUAGE plv8;

SELECT get_json_key(entry, 'bopomofo') FROM moe;
-- "ㄇㄥˊ"

plv8x: Operators
SELECT entry |> 'this.bopomofo' FROM moe;
-- "ㄇㄥˊ"
SELECT entry ~> '@bopomofo' FROM moe;
-- "ㄇㄥˊ"
SELECT '@bopomofo' <~ entry FROM moe;
-- "ㄇㄥˊ"
SELECT ~> 'new Date';
-- "2013-04-17T12:31:57.523Z"

plv8x: Command Line
npm i -g plv8x
export PLV8XCONN=dbname
plv8x -r script.ls # .js works too
plv8x -E 'plv8.execute("SELECT entry FROM moe").0.entry.definitions'
# [ { type: '名', def: '草木初生的芽。' },
# { type: '名', def: '事物發生的開端或徵兆。' },
# { type: '名', def: '人民。' } ]

plv8x: Modules
npm i -g uax11
plv8x -i uax11
plv8x -E 'require "uax11" .toFullwidth "méng"'
# ｍｅｎｇ

SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe;
-- "ｍｅｎｇ"

plv8x: Functions
plv8x -f 'text fullwidth(text)=uax11:toFullwidth'
plv8x -f 'text PINYIN(json)=:&0.pinyin.toUpperCase!'

SELECT fullwidth('Ingy döt Net');
-- Ｉｎｇｙ　ｄｏｔ　Ｎｅｔ
SELECT fullwidth( PINYIN(entry) ) FROM moe;
-- ＭＥＮＧ

Summary
• V8 : JavaScript engine
• PLV8 : Stored procedures in JavaScript
• plv8x : Package manager for PLV8
• Turns NPM modules into SQL functions
• JSON expressions with ~> and <~
• Code reuse for browser + server + database !

Cutting out the Middleware
• Serve JSON API from SQL
• Shared models & validation code
• Put Business Logic into DB
• Perfect fit for Medium Data™

The Revised MoE Dictionary (1994)

The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!

The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?

The Sad

“
本會非常歡迎各位來連結「國語辭典」，但是
本會目前只開放以超連結 (hyperlink) 的方式與
國語辭典首頁連結，至於其他方式本會並未對
外開放授權。若還有疑問或建議，歡迎來信。
⧸／教育部國語推行委員會〈有關授權〉

.…and the Very Crazy
• 不需登入的網頁，會自動把你登出！

Yeh’s Ping, 2013.1.26.

“
所以我要響應零時政府 g0v.tw 的活動，來做
3du.tw，把字、詞、成語、定義、例句等等正
體中文資料，用開放的文字 API 釋放出來，加
上索引和搜尋的功能，讓任何想加值的個人或
公司都可以使用。
⧸／葉平〈還文於民〉

零時黑客

集體砍站事件

g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps…

Finished in 24 hours!
Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,
Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…

粗略的共識

會動的程式

Applications
• XUL Desktop App (@racklin)
• OS X Dictionary (@yllan)
• Windows 8 App (@wenpei)
• iOS Client (@tomjpsun, @jamessa, @pct)
• iOS Offline App (@zonble)

Integrations
• Rails API server (@albb0920)
• AngularJS Client+Server (@viirya)
• Chrome Extension (@tonytonyjan)
• Sublime Text plugin (@zonble)
• WinRT Component (@eriksk)

Fair Use

“
為非營利之教育目的，依著作權法第 50 條，
「以中央或地方機關或公法人之名義公開發表
之著作，在合理範圍內，得重製、公開播送或
公開傳輸。」此處轉換格式、重新編排的編輯
著作權(如果有的話)由 @kcwu 以 CC0 釋出。

CC0: Public Domain

“
除前述資料檔之外，本目錄下的所有其他檔
案，由作者唐鳳在法律許可的範圍內，拋棄該
著作依著作權法所享有之權利，包括所有相關
與鄰接的法律權利，並宣告將該著作貢獻至公
眾領域。

5 Stars of Open Data
1. ⊙☉ Open License
2. Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items

URI Endpoints
• https://moedict.tw/#文字
• 3 APIs (for non-Unicode characters):
• /raw/文字.json {[8ff0]}
• /uni/文字.json ⿰亻壯
• /pua/文字.json U+F8FF0

Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like ⿰亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所

Reaching the Fifth Star
1. ⊙☉ Open License
2. Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items

Chinese Segmentation
• Therearenowhitespacesbetweenwords
• Lots of heuristic algorithms
• Naive solution: Longest-token match
• Requires a large dictionary
• …wait, we just got one here

In-browser Implementation
{"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之
貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五
一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代巨擘|代楷模|代風流|
代鼎臣|以當十|以貫之|來一往|來二去|依舊式|個勁兒|個子兒|個樣
兒|倡三歎|倡百和|偏之見|傅眾咻|償宿願|元大武|元復始|兵一卒|刀
一割|刀兩斷|刀兩段|分一毫|切從簡|切現成|切眾生|刻千金|力承當|
勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|口價兒|口兩匙|口咬
定|口咬死|古腦兒|名半職|吐為快|吹一唱|呼再諾|呼百應|呼百諾|命
嗚呼|哄而上|哄而散|哄而起|哄而集|唱一和|唱三歎|唱百和|喫一添|

Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here

/a/文字.json
{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、
`記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書
~。","q":["`五代史~`平話~．`梁~`史~．`卷~`上~：「`您~`去~`攻
破~`宋~`州~，`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~，`便~`
發文~`字~`來~`報~`我~。」","`警世通言~．`卷~`十~`三~．`三~`現
身~`包龍圖~`斷~`冤~：「`有~`甚事~`煩惱~？`想~`是~`縣~`裡~`有
~`甚~`文字~`不了~。」"]}],"p":"wén zì"}],"t":"`文~`字~"}

Materialized View: 160k .json files
(@obra++)

Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!

User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Volume key on Android resizes fonts (@ivan)
• Top Request: Taiwanese Bân-lâm-gi

Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories

Taiwan Bân-lâm-gi Common Dictionary
(MoE, 2011)

Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all entries
• Licensed under CC-BY-ND 3.0

Not-so-good Parts
• Entries are in non-bookmarkable <iframe>s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric

Crowd-OCR for 154 glyphs, 2013.3.25.

Finished over lunch!
Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,
@YuanChao, @clkao, @MGDesigner, @gontera…

Database received, 2013.3.27.
• 詞目總檔.xls 詞目總檔.屬性對照.xls
• 釋義.xls 釋義.詞性對照.xls
• 又音.xls 又音.屬性對照.xls
• 近義詞對應.xls 反義詞對應.xls
• 詞彙方言差.xls 語音方言差.xls
• 例句.xls

.…What about that extra request?

“
您好：
資料匯入目前大致無誤。不過，twblg 網頁上
的「華語檢索」，可以用「一乾二淨」找到閩
語典的「離離」條目，這個對照表似乎沒有在
Excel 檔中看到？

Well…

“
語言之間的對譯，不能盡然以詞彙對應，對不
夠深入了解的使用者來說，會讓他誤以為A語
言的X詞等於B語言的Y詞（並且這種呈現，會
被民眾認知為「教育部的辭典說的」）。

However…

“
因此華語對應這個欄位，我們是藏在系統中。
如果是民間的辭典編輯，會比較沒有這個負
擔，因此我這裡確實不能給，非常希望你們能
有辦法解決。

.…it’s all good.

“
好的，感謝您的提醒和協助。
目前從網頁以 Big5 範圍取出的華語條目，
共有 26274 筆對映。
在應用上，這部份我們會註明不屬於教育部
CC-BY-ND 的授權範圍。

Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!

PgREST: MongoLab API Server
• GET /collections/table_or_view
• q=&c=true&f=&fo=true&s=&sk=&l=

curl $LY/collections/bills?q={"proposal.0":"吳育昇"}
curl $MOE/collections/entries?q={"部首":"一"}&c=1

• PUT /collections/table_or_view

PgREST: Import/Export
pgrest dbname
export MOE=http://127.0.0.1:3000
curl -i -X PUT -H "Content-Type: text/csv"
--data-binary @uni/詞目總檔.csv $MOE/collections/entries

curl $MOE/collections/entries
̍
# [{"主編號","1","屬性":"1","詞目":"一","音讀":"tsit",
# "文白俗替":"替","部首":"一","部首序":"001-00-01","方言差":""}]

PgREST: 3du.tw JSON in 48 lines
https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls

“

Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions

When is Transparency Useful?

“
眾人為了共同目標聚在一起，才能做出改變，
科技人很難獨力完成。
衡量成功的標準，可以是有多少人的生命因你
獲得改善，而不只是有多少人看你架的網站。
— Aaron Swartz, «Open Government»

Thank you!

“ 新的轉機和閃閃星斗，
正在綴滿沒有遮攔的天空。
那是五千年的象形文字，
那是未來人們凝視的眼睛。

⧸／北島〈回答〉

PgREST: Node.js in the Database

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a PgREST: Node.js in the Database

Similar a PgREST: Node.js in the Database (20)

Más de Audrey Tang

Más de Audrey Tang (20)

Último

Último (20)

PgREST: Node.js in the Database