3. @honishi
hiroyuki onishi
honishi.tumblr.com
since mid 2008
• not a truly seasoned programmer
• writing code just for fun
• consultant for FAST Search Server at Microsoft
4. “honishi” is my secondary identity on the web,
my primary ones are:
10. core concept
• prerequisiteexcept login
• no api,
• poor performance of iPhone 3g
• scraping
• text-based scraping
• not xml-based: fat?
• dom...... slow?
• sax complex?
• as fast as possible
• minimize processing
• minimize network traffic
11. main user interface
• lots of webviews...
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
UIWebView
x11 for browsing x2 for reblogging
(1 unhidden, 10 hidden) (always hidden)
12. main user interface
(cont’d)
• it’s slow to start rendering all webview at
one time
• so webviews are gradually warmed up
(debug view)
13. days of fixing app
• initial release ... jun 2009
• released after 4 rejections by Apple
• days of fixing app ... after release
• every little modification on dashboard
affects app’s scraping logic
14. an opinion from
the opinion leader
• scraping should be executed on server side
• when is a need to modify scraping logic
there
the structure of html changes,
• if the logic is implemented long time to
application, it usually takes
within client
release fixed app; submit the build, wait
for Apple’s review, being reviewed by
Apple...
• it’s also better for cross-platform
application provisioning
16. weakness of
server side scraping
• scalability?
• all connections & accesses in single point
• need to invest for computing resources there
• possibility of ban?
• service provider can easily identify massive
transactions from one location
• once banned, it’s over
• security?
• no oauth provided at that time
• so need to have & use user’s password at server
side
18. restructuring for
fault-tolerance
• splitting the scraping processes into 2 blocks:
• logic for scraping
• metadata for above
• store them in difference places:
• logic inside of the app
• metadata outside of the app, s3
• metadata is read from the app at the time of
startup.
19. logic & metadata
logic(process): metadata:
1. read dashboard base url?
2. pre-process target? how?
boundaries for:
3. split posts
html header? footer? post?
4. find next link (then back to 1.) base url?
elements for the url?
inside app outside app
20. scraping metadata
• simple property list
• almost all rules are written in simple string
or regular expression
• located on amazon s3
• http://s3.amazonaws.com/tumblrgear/parsemeta.plist
27. #3. split post
• detect boundaries in the html
• then split them into header, footer and
posts
# key value
1 pageHeaderSplitter <!-- START POSTS -->
2 pageFooterSplitter <!-- END POSTS -->
3 postBeginSplitter <li id="post_
4 postEndSplitter <!-- END POSTS -->
28. #4. find next link
• find next link next link using elements
elements
• assemble the
# key value
1 nextLinkUrl http://www.tumblr.com{1}
2 nextLinkElements <a id="next_page_link" href="(.*)">
• then read next page
29. stored posts
html header header
post #1 footer header
post #2
: post #n
post #9 posts
array
post #10 footer
html footer
split html stored separately concatenate on demand
31. reblog
• detect reblog url of the post
# key value
1 reblogUrl http://www.tumblr.com{1}
2 reblogElements <a href="(/reblog/.*?)">
• get the raw html from the url
32. reblog (cont’d)
• preprocess the html (disable img src etc...)
# key value
1 reblogReplace <(script .*?</script)> ;; <!--$1-->
2 reblogReplace <link ;; <disabled_link
3 reblogReplace <img ;; <disabled_img
• send the html to webview for reblogging
33. reblog (cont’d)
• do the javascript thingsinto text area, if provided
• put the commentbutton
• push the submit
# key value
1 reblogAddCommentJS (javascript here ... snip)
2 reblogSubmitJS (javascript here ... snip)
• wait for redirect back to dashboard
# key value
1 reblogRedirectUrl http://www.tumblr.com/dashboard
• done
34. like
• detect like url of the post
# key value
1 likeUrl http://www.tumblr.com/like/{2}?form_key={3}&id={1}
2 likeElements type="hidden" name="id" value="(.*?)"
3 likeElements action="/like/(.*?)"
4 likeElements name="form_key"s+value="(.*?)"
• do the simple postcode 200
• wait for response
• done
38. recommended
migration path
• for iOS users ... Tumbletail
• for Android users ... Tumblife
39. conclusion
• ibecause:currently do not use this app,
myself
• softbank’s very poor signal everywhere
• reducing numberenough for me to
accounts, so it’s
of following
check the dashboard using pc in the
bed
48. overview
suspend?
dashboard
api post queue display queue display
(mutable array) (mutable array) (nswindow)
w/ since_id
1 post
/ dequeue
polling every 10 sec polling every 2 sec
•open post?
•reblog?
•like?
49. Growl, forked
• extracting the display window from Growl
the displaying module
• extending
out of box window: extended window:
x x r o l
icon title avatar blog name
description
image area
upper text area
lower text area
title
source
50. miscellaneous
• oauth & webview
• all cookiesofare shared (default behavior)
instances webview
in safari & all
• so the login sequence to get authorized
doesn’t work expectedly
• need to override containerinmanually
to handle cookie
delegate webview
• xauth...?