1. The S4 project & code
OWNER: Raphael Bonaque
PRESENTER: Juan Álvaro Muñoz Naranjo
OAK Code Days
16-18 October 2014
2. (Very) general overview
— S4 stands for “Social Semantic Structured Search”
— Goal: RDF-based keyword search engine in social and structured
environments (currently Twitter)
— Keywords to be searched are defined by RDF semantics
— Results are ranked by proximity and position of the (extended)
keywords within the documents and their comments
— Examples: searching for “animal” should return tweets
containing “cat”, “dog”, “eagle” sorted by the ranking criteria
— Keywords are currently taken from DBPedia
3. Programming language
—
History, versions
— Recent project
— Two branches:
— Storage through serialization
— Storage via PostgreSQL
— No code is reused from or into other projects
5. Code repository
— https://gforge.inria.fr/scm/viewvc.php/?root=xrp
Folder “postgres4”: version for the PostgreSQL DB.
(permission needed)
Papers
— R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. Toward
Social, Structured and Semantic Search. SDSW’14, co-located
with ISWC’14.
— R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. S4
Structured Social and Semantic Search (working draft).
6. Overview of the software
— Input:
— User query
— Twitter (static) database
— RDF semantics
— Output:
— A ranked collection of tweets
7. Main modules
1. Tweets retrieval
• Use of Twitter API through the
TweetPy library
• Compresses retrieved data
• receiving.py: tweet retriever through TweetPy
• archiving.py: data compression and management
• secrets.py: API key (not in the repo)
2. Semantics retrieval & storage
• RDF semantics creation & storage
from DBPedia
• rdf_db.py: PostgreSQL I/O wrapper
3. Tweets storage
• Decompresses tweets
• Parses tweets according to RDF sems.
• Stores parsed tweets in DB
• twitter_database.py: (old ver.) object serialization
• social_db.py: (new ver.) PostgreSQL I/O wrapper
• archiving.py
• config.py: database parameters (conn. string, etc)
4. Search engine
• Search algorithm
• algorithm.py: interface for algorithms
• baseline_algorithm.py: actual algorithm and entry
point
9. External software
— TweetPy: twitter API interface for Python
http://github.com/tweepy/tweepy
— Twitter_nlp: Tweet natural language processing for Python
http://github.com/aritter/twitter_nlp
— Psycopg: PostgreSQL adapter for Python
http://initd.org/psycopg
— Scipy: scientific calculations library for Python
http://www.scipy.org
— XZ: data compression tool
http://tukaani.org/xz
— Matplotlib: (soon) plotting library for Python
http://matplotlib.org
10. TODO
— Implement execution scripts
— Testing, benchmarking
— Graph drawing
— Optimization: query rewriting
— Use of the “RDF loader into PostgreSQL” project
— Alternatives to the baseline algorithm
Known bugs
— TweetPy crashes randomly , so Raphael had to make a
wrapper to restart it when needed