Real Time Semantic Search for Social TV streams

Textalytics: Meaning-as-a-Service

Real Time Semantic
Search for Social TV
streams
César de Pablo Sánchez
Daedalus
8/11 2013

Big Data Spain (Madrid)

The plot
1. What's Social TV?
2. Monitoring Social TV conversations. A
preliminary architecture
3. Understanding the buzz. Textalytics
4. Organizing the mess. SenseiDB
5. Lessons learned

Social TV
Second
Screen

Transmedia

Not just TV
Sports
Elections

Alerts

Big Data?
Volume
Velocity

Variety

Users?
Viewers
Channels

Brands

Viewers?
Keep
updated

Participate

Confirm
beliefs

Belong to
group

Influence

Vote

Viewers?
Keep
updated

Participate

Confirm
beliefs

Influence

Channels?
Understand

React

Measure

Brands?
Select
programs
Reputation

Find
public

Brands?

Reputation

Find
public
Example from Bluefin Labs

Monitoring Social TV
conversations.
The architecture

Pull

gateway

pipeline

tracker

HTTP Stream

EPG

Understanding the buzz
Textalytics API

Text Classification
Sentiment
Analysis

Topics Extraction

Language
identification

User Demographics

Core API
Lemmatization,
POS and Parsing

Spell, Grammar and
Style
Semantic Linked
Data Viewer

Speeech
Recognition and
Speaker Diarization

Language identification
●

Given a text identify a language list - or just one

●

62 languages

●

Using language ngrams signatures

●

Social TV
●

●

Filter – TV hashtags often implies language
Sometimes hashtags are multilingual – but not
relevant for users

Text Classification
●

Theme labels – IPTC
●

Relevance

●

Multiple labels

●

Tailored for short text (tweets)

●

Define your own models and categories

●

Social TV – filter on topic content

Sentiment analysis
●

Document level classification
●

Positive/Negative/Neutral

●

Subjective/Objective

●

Tailored for short texts

●

●

Handles twitter jargon – RT, @, hashtags, emoticons,
spelling errors, disfluence

Other features
●

Entity level sentiment

●

Segment level sentiment

Topics Extraction
●

12 main types

●

Ontology with > 200 types

●

●

●

Instances – BBVA

➔

Ben Bernanke, Mariano Rajoy…
➔

➔

➔

●

populate custom dictionaries –
programs, celebrities, fictional
characters
relationship

Ubicaciones:
Londres, EE.UU., París…

Conceptos:
prima de riesgo, presidente del
Gobierno, intervención parlamentaria,
índice bursátil, situación económica…

SocialTV:
●

Entidades económicas:
Ibex35, Dax Xetra…

➔

●

Empresas, Organizaciones:
BBVA, Bankia, Goldman Sachs, Coca-Cola,
Reserva Federal…

Classes – bank
fictional/historic

Personas:

➔

Referencias de tiempo:
hoy, ayer, sobre las 11 de la mañana…

➔

Cantidades económicas:
104 dólares, 1 euro…

Entity Linking
●

Linking entities to their 'real' representation

●

Linking to several LOD sources

API
●

NLP and Semantics API

●

Multilingual: EN, ES (FR,IT,PT,CA)

●

REST Service : JSON and XML

●

Combine best of all worlds
●

Deep language analysis

●

Comprehensive resources: linguistics and Dbs

●

Ontology

●

Rule Based Method

●

Statistics and Machine Learning Methods

API
●

High level semantic API – close to bussines scenarios

API Análisis
Medios

…

Configuración y
Recursos
Lingüísticos
●

API Publicación
Semántica

Configuración y
Recursos
Lingüísticos

Configuración y
Recursos
Lingüísticos

Core API – building blocks
Topics

Classif.
Sentiment

POS
Linked Data

SenseiDB
●

●

Open source, distributed, realtime, semistructured database
From LinkedIn sna: powering Linkedin home and
LinkedIn signals
●

Integrates other open source technologies:
–
–

Bobo - faceted search

–
●

Zoie – lucene based search engine
Apache Kafka – pub-sub system

http://www.senseidb.com/

SenseiDB features
●

'Hybrid' Information Retrieval – Database
●

Full text search

●

Structured and faceted search

●

Fast real time updates with low latency and high troughput
– pull model

●

Single table/collection

●

BQL – a SQL like language

●

Eventual consistency

●

Distributed – sharding and partitioning

●

Hadoop integration

Faceted search
●

●

●

Amazon.com?
Identify relevant
attributes to use as
filters
Predefined facets
●

●

●

Define a table schema
Define fields as facets
– facet schema

Efficient - in memory

Faceted search in depth
●

Field types
●

●

●

Basic: string, int, short, long, float, double, char
Complex: date and text (analyzed, termvectors)

Facet types
●

Simple : 1 row – 1 value

●

Hierarchical – Path c>b>a

●

Range – define ranges

●

Multi : 1 row – n values

●

Histogram – define bins and their size

●

TimeRange – for real time data

●

Custom

Real time indexing
●

Data events – add and delete

●

Data streams – succession of data events

●

Gateways
●

Read data events from data streams

●

File

●

JDBC

●

JMS

●

Kafka

●

Custom: Twitter

BQL – search, filter and facets
●

●

●

●

Search – common boolean and phrase
operators
Filters – where contitions
Facets support basic analytics task defined on
facets
Relevance
●

Default – recency

●

Ad-hoc - may be defined in query

BQL Query Example on Tweets
SELECT *
WHERE hashtags in (“TopChef”)
BROWSE BY
hashtags, user_screen_name, urls

Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”

Query examples
SELECT *
BROWSE by entities, sentiment

Query examples
SELECT *
AND time IN LAST 2 hours
BROWSE by entities, sentiment

Using facets for semantic search
●

Define a facet for:
●

●

●

●

entities/concept → tweets about Chicote – include
all variants + user + hashtags
for each entity types → Navigate by type – Popular
people
classification/sentiment/emotions → Positive
tweets about Chicote
users or hashtags → popular users / popular
mentions / correlated hashtags

Scalability
●

Zookeper to keep replicas

●

Low indexing latency (no batch commit)

●

Low search latency – even with indexing bursts

●

Horizontally scalable – shards

●

Shards may be replicated N times

●

Elastic – nodes can be added to accomodate
growth

Other features
●

Batch indexing via Hadoop – ETL

●

Simple analytics by batch indexing

●

Customized relevance models

●

MapReduce functions over facets
●

●

●

Sum, avg, min, max
DistinctCount

Activity values – volatile values – likes

Conclusions
●

●

SenseiDB is fast at searching/indexing – no
variance
A couple nodes enough to handle Spanish
SocialTV volume

●

Love query language and time operators - BQL

●

Support real time exploration

Limitations
●

SenseiDB
●

●

Single table model – flat users and reputation

●

Tricks to store complex facets

●

●

Documentation is still scarce

Manageability

Social TV Tracker
●

●

Group and disambiguate entity mentions across
tweets
Relevance is tricky – ad hoc

Comparison
●

Solr
●

NearRT updates
–

●
●

●

●

ElasticSearch
●

Soft commits

Simple facets

●

Popular – great
tools

●

Storm, S4 ?

●

Batch/realtime
commits
On line facets
Aggregation after
facets
Much better plugin
system

Thanks and QA
@zdepablo
#bigdata #socialtv #2ndscreen
#nlp @textalytics

Real Time Semantic Search for Social TV streams

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (6)

Similar a Real Time Semantic Search for Social TV streams

Similar a Real Time Semantic Search for Social TV streams (20)

Más de Sngular Meaning

Más de Sngular Meaning (20)

Último

Último (20)

Real Time Semantic Search for Social TV streams