Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Social Data and Log Analysis Using MongoDB
1. Social Data and Log Analysis
Using MongoDB
2011/03/01(Tue) #mongotokyo
doryokujin
2. Self-Introduction
• doryokujin (Takahiro Inoue), Age: 25
• Education: University of Keio
• Master of Mathematics March 2011 ( Maybe... )
• Major: Randomized Algorithms and Probabilistic Analysis
• Company: Geisha Tokyo Entertainment (GTE)
• Data Mining Engineer (only me, part-time)
• Organized Community:
• MongoDB JP, Tokyo Web Mining
3. My Job
• I’m a Fledgling Data Scientist
• Development of analytical systems for social data
• Development of recommendation systems for social data
• My Interest: Big Data Analysis
• How to generate logs scattered many servers
• How to storage and access to data
• How to analyze and visualization of billions of data
4. Agenda
• My Company’s Analytic Architecture
• How to Handle Access Logs
• How to Handle User Trace Logs
• How to Collaborate with Front Analytic Tools
• My Future Analytic Architecture
5. Agenda Hadoop,
Mongo Map Reduce
• My Company’s Analytic Architecture Hadoop,
Schema Free
• How to Handle Access Logs
• How to Handle User Trace Logs REST Interface,
JSON
• How to Collaborate with Front Analytic Tools
Capped Collection,
• My Future Analytic Architecture Modifier Operation
Of Course Everything With
7. Social Game (Mobile): Omiseyasan
• Enjoy arranging their own shop (and avatar)
• Communicate with other users by shopping, part-time, ...
• Buy seeds of items to display their own shop
13. How to Handle Access Logs
Pretreatment: Trimming, As a Data Server
Validation, Filtering, ...
Back
Up To
S3
14. Access Data Flow
Caution: need
MongoDB >= 1.7.4
user_pageview
agent_pageview daily_pageview
Pretreatment 2nd Map Reduce
user_access hourly_pageview
1st Map Reduce
Group by
15. Hadoop
• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]
• Read all records
• Split each record by ‘¥s’
• Filter unnecessary records (such as *.swf)
• Check records whether correct or not
• Insert (save) records to MongoDB
※ write operations won’t yet fully utilize all cores
18. 1st Map Reduce
• [Aggregation]
• Group by url, date, userId
• Group by url, date, userAgent
• Group by url, date, time
• Group by url, date, statusCode
• Map Reduce operations runs in parallel on all shards
20. # ( mongodb >= 1.7.4 )
result = db.user_access.map_reduce(map,
reduce,
marge_out="user_pageview",
full_response=True,
query={"date": date})
• About output collection, there are 4 options: (MongoDB >= 1.7.4)
• out : overwrite collection if already exists
• marge_output : merge new data into the old output collection
• reduce_output : reduce operation will be performed on the two values
(the same key on new result and old collection) and the result will be
written to the output collection.
• full_responce (=false) : If True, return on stats on the operation. If False,
No collection will be created, and the whole map-reduce operation will
happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc
in 1.8?).
21. Map Reduce (>=1.7.4):
out option in JavaScript
• "collectionName" : If you pass a string indicating the name of a collection, then
the output will replace any existing output collection with the same name.
• { merge : "collectionName" } : This option will merge new data into the old
output collection. In other words, if the same key exists in both the result set and
the old collection, the new key will overwrite the old one.
• { reduce : "collectionName" } : If documents exists for a given key in the result
set and in the old collection, then a reduce operation (using the specified reduce
function) will be performed on the two values and the result will be written to
the output collection. If a finalize function was provided, this will be run after
the reduce as well.
• { inline : 1} : With this option, no collection will be created, and the whole map-
reduce operation will happen in RAM. Also, the results of the map-reduce will
be returned within the result object. Note that this option is possible only when
the result set fits within the 8MB limit.
http://www.mongodb.org/display/DOCS/MapReduce
27. Current Map Reduce is Imperfect
• [Single Threads per node]
• Doesn't scale map-reduce across multiple threads
• [Overwrite the Output Collection]
• Overwrite the old collection ( no other options like “marge” or
“reduce” )
# mapreduce code to merge output (MongoDB < 1.7.4)
result = db.user_access.map_reduce(map,
reduce,
full_response=True,
out="temp_collection",
query={"date": date})
[db.user_pageview.save(doc) for doc in temp_collection.find()]
28. Useful Reference: Map Reduce
• http://www.mongodb.org/display/DOCS/MapReduce
• ALookAt MongoDB 1.8's MapReduce Changes
• Map Reduce and Getting Under the Hood with Commands
• Map/reduce runs in parallel/distributed?
• Map/Reduce parallelism with Master/SlaveA
• mapReduce locks the whole server
• mapreduce vs find
33. Hadoop
• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]
• Split each record by ‘¥s’
• Filter Unnecessary Records
• Check records whether user behaves dishonestly
• Unify format to be able to sum up ( Because raw records are
written by free format )
• Sum up records group by “userId” and “actionType”
• Insert (save) records to MongoDB
※ write operations won’t yet fully utilize all cores
34. An Example of User Trace Log
UserId ActionType ActionDetail
35. An Example of User Trace Log
-----Change------
ActionLogger a{ChangeP} (Point,1371,1383)
ActionLogger a{ChangeP} (Point,2373,2423)
------Get------
ActionLogger a{GetMaterial} (syouhinnomoto,0,-1) The value of “actionDerail”
ActionLogger a{GetMaterial} usesyouhinnomoto
ActionLogger a{GetMaterial} (omotyanomotoPRO,1,6)
must be unified format
-----Trade-----
ActionLogger a{Trade} buy 3 itigoke-kis from gree.jp:00000 #
-----Make-----
ActionLogger a{Make} make item kuronekono_n
ActionLogger a{MakeSelect} make item syouhinnomoto
ActionLogger a{MakeSelect} (syouhinnomoto,0,1)
-----PutOn/Off-----
ActionLogger a{PutOff} put off 1 ksuteras
ActionLogger a{PutOn} put 1 burokkus @2500
-----Clear/Clean-----
ActionLogger a{ClearLuckyStar} Clear LuckyItem_1 4 times
-----Gatcha-----
ActionLogger a{Gacha} Play gacha with first free play:
ActionLogger a{Gacha} Play gacha:
42. Categorize Users
user_trace Attribution • [Categorize Users]
user_registrat
• by play term
Attribution ion
user_charge • by total amount
of charge
• by registration
Attribution
date
user_savedata
user_category
Attribution
• [ Take an Snapshot
of Each Category’s
user_pageview
Stats per Week]
44. Collection: user_category
> var cross = new Cross() # User Definition Function
> MCResign = cross.calc(“2011-02-12”,“MC”,1)
# each value is the number of the user
# Charge(yen)/Term(day)
0(z) ~¥1k(s) ~¥10k(m) ¥100k~(l) total
~1day(z) 50000 10 5 0 50015
~1week(s) 50000 100 50 3 50153
~1month(m) 100000 200 100 1 100301
~3month(l) 100000 300 50 6 100356
month~(ll) 0 0 0 0 0
48. Data Table: jQuery.DataTables
[ Data Table ] •
1 Variable length pagination
2 On-the-fly filtering
3 Multi-column sorting with data
type detection
• Want to Share Daily Summary 4 Smart handling of column widths
5 Scrolling options for table
• Want to See Data from Many
Viewpoint viewport
6 ...
• Want to Implement Easily
49. Graph: jQuery.HighCharts
[ Graph ] •
1. Numerous Chart Types
2. Simple Configuration Syntax
3. Multiple Axes
• Want to Visualize Data 4. Tooltip Labels
• Handle Time Series Data Mainly 5. Zooming
• Want to Implement Easily 6. ...
50. sleepy.mongoose
• [REST Interface + Mongo]
• Get Data by HTTP GET/POST Request
• sleepy.mongoose
‣ request as “/db_name/collection_name/_command”
‣ made by a 10gen engineer: @kchodorow
‣ Sleepy.Mongoose: A MongoDB REST Interface
51. sleepy.mongoose
//start server
> python httpd.py
…listening for connections on http://localhost:27080
//connect to MongoDB
> curl --data server=localhost:27017 'http://localhost:27080/
_connect’
//request example
> http://localhost:27080/playshop/daily_charge/_find?criteria={}
&limit=10&batch_size=10
{"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id":
0}}
52. JSON: Mongo <---> Ajax
sleepy.mongoose
(REST Interface)
Get
JSON
• jQuery library and MongoDB are compatible
• It is not necessary to describe HTML tag(such as <table>)
70. Summary
• Almighty as a Analytic Data Server
• schema-free: social game data are changeable
• rich queries: important for analyze many point of view
• powerful aggregation: map reduce
• mongo shell: analyze from mongo shell are speedy and handy
• More...
• Scalability: using Replication, Sharding are very easy
• Node.js: It enable us server side scripting with Mongo
72. I ♥ MongoDB JP
• continue to be a organizer of MongoDB JP
• continue to propose many use cases of MongoDB
• ex: Social Data, Log Data, Medical Data, ...
• support MongoDB users
• by document translation, user-group, IRC, blog, book,
twitter,...
• boosting services and products using MongoDB
73. Thank you for coming to
Mongo Tokyo!!
[Contact me]
twitter: doryokujin
skype: doryokujin
mail: mr.stoicman@gmail.com
blog: http://d.hatena.ne.jp/doryokujin/
MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja