This document discusses monitoring and analyzing social TV conversations in real-time. It presents an architecture that pulls data from social media streams, processes it through a pipeline and tracker, and analyzes the text with Textalytics APIs. Textalytics performs tasks like language identification, text classification, sentiment analysis, and topic extraction on short social media texts. It also links entities to linked open data sources. The analyzed data is organized and stored in the SenseiDB database, which allows real-time indexing, faceted search, and filtering of social TV data. Lessons learned include SenseiDB's ability to handle Spanish social TV volume with just a few nodes and its query language and time operators. Limitations discussed include its single table
4. The plot
1. What's Social TV?
2. Monitoring Social TV conversations. A
preliminary architecture
3. Understanding the buzz. Textalytics
4. Organizing the mess. SenseiDB
5. Lessons learned
19. Language identification
●
Given a text identify a language list - or just one
●
62 languages
●
Using language ngrams signatures
●
Social TV
●
●
Filter – TV hashtags often implies language
Sometimes hashtags are multilingual – but not
relevant for users
20.
21. Text Classification
●
Theme labels – IPTC
●
Relevance
●
Multiple labels
●
Tailored for short text (tweets)
●
Define your own models and categories
●
Social TV – filter on topic content
22. Sentiment analysis
●
Document level classification
●
Positive/Negative/Neutral
●
Subjective/Objective
●
Tailored for short texts
●
●
Handles twitter jargon – RT, @, hashtags, emoticons,
spelling errors, disfluence
Other features
●
Entity level sentiment
●
Segment level sentiment
23. Topics Extraction
●
12 main types
●
Ontology with > 200 types
●
●
●
Instances – BBVA
➔
Ben Bernanke, Mariano Rajoy…
➔
➔
➔
●
populate custom dictionaries –
programs, celebrities, fictional
characters
relationship
Ubicaciones:
Londres, EE.UU., París…
Conceptos:
prima de riesgo, presidente del
Gobierno, intervención parlamentaria,
índice bursátil, situación económica…
SocialTV:
●
Entidades económicas:
Ibex35, Dax Xetra…
➔
●
Empresas, Organizaciones:
BBVA, Bankia, Goldman Sachs, Coca-Cola,
Reserva Federal…
Classes – bank
fictional/historic
Personas:
➔
Referencias de tiempo:
hoy, ayer, sobre las 11 de la mañana…
➔
Cantidades económicas:
104 dólares, 1 euro…
26. API
●
NLP and Semantics API
●
Multilingual: EN, ES (FR,IT,PT,CA)
●
REST Service : JSON and XML
●
Combine best of all worlds
●
Deep language analysis
●
Comprehensive resources: linguistics and Dbs
●
Ontology
●
Rule Based Method
●
Statistics and Machine Learning Methods
27. API
●
High level semantic API – close to bussines scenarios
API Análisis
Medios
…
Configuración y
Recursos
Lingüísticos
●
API Publicación
Semántica
Configuración y
Recursos
Lingüísticos
Configuración y
Recursos
Lingüísticos
Core API – building blocks
Topics
Classif.
Sentiment
POS
Linked Data
29. SenseiDB
●
●
Open source, distributed, realtime, semistructured database
From LinkedIn sna: powering Linkedin home and
LinkedIn signals
●
Integrates other open source technologies:
–
–
Bobo - faceted search
–
●
Zoie – lucene based search engine
Apache Kafka – pub-sub system
http://www.senseidb.com/
30. SenseiDB features
●
'Hybrid' Information Retrieval – Database
●
Full text search
●
Structured and faceted search
●
Fast real time updates with low latency and high troughput
– pull model
●
Single table/collection
●
BQL – a SQL like language
●
Eventual consistency
●
Distributed – sharding and partitioning
●
Hadoop integration
32. Faceted search in depth
●
Field types
●
●
●
Basic: string, int, short, long, float, double, char
Complex: date and text (analyzed, termvectors)
Facet types
●
Simple : 1 row – 1 value
●
Hierarchical – Path c>b>a
●
Range – define ranges
●
Multi : 1 row – n values
●
Histogram – define bins and their size
●
TimeRange – for real time data
●
Custom
33. Real time indexing
●
Data events – add and delete
●
Data streams – succession of data events
●
Gateways
●
Read data events from data streams
●
File
●
JDBC
●
JMS
●
Kafka
●
Custom: Twitter
34. BQL – search, filter and facets
●
●
●
●
Search – common boolean and phrase
operators
Filters – where contitions
Facets support basic analytics task defined on
facets
Relevance
●
Default – recency
●
Ad-hoc - may be defined in query
35. BQL Query Example on Tweets
SELECT *
WHERE hashtags in (“TopChef”)
BROWSE BY
hashtags, user_screen_name, urls
39. Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
AND time IN LAST 2 hours
BROWSE by entities, sentiment
40. Using facets for semantic search
●
Define a facet for:
●
●
●
●
entities/concept → tweets about Chicote – include
all variants + user + hashtags
for each entity types → Navigate by type – Popular
people
classification/sentiment/emotions → Positive
tweets about Chicote
users or hashtags → popular users / popular
mentions / correlated hashtags
42. Scalability
●
Zookeper to keep replicas
●
Low indexing latency (no batch commit)
●
Low search latency – even with indexing bursts
●
Horizontally scalable – shards
●
Shards may be replicated N times
●
Elastic – nodes can be added to accomodate
growth
43. Other features
●
Batch indexing via Hadoop – ETL
●
Simple analytics by batch indexing
●
Customized relevance models
●
MapReduce functions over facets
●
●
●
Sum, avg, min, max
DistinctCount
Activity values – volatile values – likes
45. Conclusions
●
●
SenseiDB is fast at searching/indexing – no
variance
A couple nodes enough to handle Spanish
SocialTV volume
●
Love query language and time operators - BQL
●
Support real time exploration
46. Limitations
●
SenseiDB
●
●
Single table model – flat users and reputation
●
Tricks to store complex facets
●
●
Documentation is still scarce
Manageability
Social TV Tracker
●
●
Group and disambiguate entity mentions across
tweets
Relevance is tricky – ad hoc