Hadoop User Group Agenda and Presentations

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Hadoop User Group
29 janvier 2015

Agenda
#1: Traitement des données non structurées (Vidéos, images, …) avec Haven pour
Hadoop,
#2: Apache Flink: Fast and Reliable Large-scale Data Processing,
#3: Etude de cas, projet Hadoop dans le domaine des RH avec Capgemini.
La vectorisation des documents : rendre comparables des informations non
structurées, de nouvelles opportunités pour un acteur de l’emploi
21h00 : Cocktail dinatoire

Hadoop User Group
Haven pour analyser 100% des
informations
Frédéric Demongeot – EMEA Subject Matter Expert
29 Janvier 2015

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.4
Big Data landscape
Human InformationMachine Data
Business
Data
10% of Information
90% of Information
Annual
Growth
~100%
~10%

Haven – Big Data platform
Haven
Social media IT/OT ImagesAudioVideo
Transactional
dataMobile Search engineEmail Texts
Catalog massive
volumes of
distributed data
Hadoop/
HDFS
Process and
index all
information
Autonomy
IDOL
Analyze at
extreme scale
in real-time
Vertica
Collect & unify
machine data
Enterprise
Security
Powering
HP Software
+ your apps
nApps
Documents
hp.com/haven

Leverage existing tools in shared Vertica and Hadoop storage environment
A few words on Vertica
Hadoop
HDFS
External Tables
Flex Tables
Click Stream, Web Session Data
Hive
Integration
(HCatalog)
webHDFS
ANSI SQL
webHCAT
Storage Tiering
Hive Pig
MapReduce
HBase
Cop
y

Vertica SQL on Hadoop

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
Autonomy Haven Examples

“Autonomy, with the power of its IDOL engine,
takes fan data, collects, it stores, and stitches it
together…that helps us understand what is being
talked about across the ecosystem of the sport.”
- Senior Director of IT, NASCAR
HAVEn Solution:
• Autonomy IDOL + Autonomy Explore + HP Enterprise
services
Results:
• Wildly successful NASCAR Fan and Media Engagement
Center collects and aggregates fan information
• Understands sentiment, identifies emerging issues, and
uncovers trends that help the NASCAR team share and enrich
the fan and broadcast experience
NASCAR

Car Manufacturer
Use Case: Brand monitoring
• Have one single Big Data platform they leverage for multiple projects
• Customer brand and partner analytics
• Connected vehicles
• Aggregate multiple data sources and store all on HDFS, such as:
• Social media
• YouTube rich media
• Internal data sources, such as CRM, car logs
• Rich media analysis - logo recognition, face recognition, speech to text, etc.
• Sentiment Analysis
HAVEn technology
• Autonomy
• Hadoop
• nApps – Integration of Haven technologies

BlaBlaCar
Improving a ride-sharing community marketplace
Challenge
• Marketing campaigns and web experience for 10M+
members in 12 countries limited by infrequent and slow
data analysis
Solution
• HP Vertica Analytics Platform
• Cloudera Distributed Hadoop, Tableau, & Data Science
Studio
Result
• Optimized performance of CRM campaigns: program
development improved experience for 2M customers per
month
• Refined targeted marketing by integrating social media
and predicting customer behavior through pattern
recognition

The Challenge of Human
Information

What is Human Information
Information that is created by people and understood by people

Why is human information different?
Human Information is made up of ideas, is diverse, and has
context.
=

Strong Information & Weak Information
Key Words are small amounts of very strong information without
context
Larger amounts of weaker information is what humans refer to as “context”
“Mercury”
Is it a planet?Is it an element?Is it a car?With high certainty; it’s an element!
“A heavy element and the only metal that is liquid at standard conditions for
temperature and pressure with the symbol Hg and atomic number 80,
commonly known as quicksilver”

How does HP IDOL approaches Human Information
Using Adaptive Probabilistic Concept Modeling
Techniques that provide continuous learning based on context.
Techniques that deal in a scalable manner with the subtlety of the real
world.
Adaptive
Probabilistic
Concept
Modeling
Techniques that inform the importance of patterns found in data
This proprietary combination of mathematics model Human Information and is;
Automatic, Fast, Data agnostic, Language independent, Scalable, Accurate,
Dynamic, Real-time, Voice & Video

IDOL - OS of Human
Information

The OS of Human Information
securely to any source of information through a single cross-enterprise interface layer;Connect
the meaning, concepts and key attributes in all types of human friendly information including documents, emails,
databases, clickstreams, audio, social and rich media, etc…Understand
Inquire, Investigate, Interact and Improve quickly, correctly and compliantly based on a holistic view of
information, market conditions and social trends.
Act &
Automate
IDOL(Intelligent Data Operating Layer) provides an interface for accessing human information from all sources and
of all types and provides common services that leverage the information to applications that need to access them to;

IDOL: the OS for human information
• Mathematically based
• 15 years and over $280M in R&D
• 170+ Patents
• Language independent
• Built for infrastructure
• All file types, all media types (voice/video)
• Scalable and with security
• Platform/OS /device agnostic
• Managed in place

Inquire
“Search your data”
When you have criteria or an object that form a question, Inquire functions allow you to return
results that answer that question. It allows you to sift through large quantities of data to find
specific documents that relate to your question or an area of interest.
Investigate
“Analyze your
data”
Investigate functions allow you to use information contained in the results of an inquiry to
analyse those results. The analysis might provide insights that allow you to improve your
inquiry, or it might provide more general information about your content.
Interact
“Personalize your
data”
Using information with affinity to the user to create conceptual Profiles and Agents that reflect
the user’s information needs which in turn can be used to power other functions, Interact
functions encompasses all the functionality to achieve this goal.
Improve
“Enhance your
data”
Improve functions enhance your information with more details that help with the Inquire and
Investigate functions. These functions allow you to add information to data of any type, be it
audio, video or text, that makes it easier to search and retrieve information, or to identify key
features of your content.
The World as Services

Analyse your Data
Inquire
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
What insights does our Human Information hold?
Is there structure that I can use to navigate the data?
Expose Concepts and Patterns
Help me evaluate the information quickly
Intelligent Summarization (simple, concept and context)
Intelligent Highlighting (search terms, phrases, concepts, context, fidelity to query grammar)
Concept Streaming (Real time summaries from Audio, contextual to queries and intent)
Intelligent Results de-duplication including “near” de-duplication
Structured, Semi-structured & XML support
Parametric Searching (unlimited nesting and association support)
Directed Navigation (create compelling navigation for users)
Structured Refinement
Automatic Query Guidance (providing top themes from query results in real time)
Concept Navigation via advanced visualizations (node graphs, theme tracking, broadcast analysis)
Language Independent

Visualization of main topics Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

Enhance your Data
Inquire
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
Human Information is rich in features that can, when identified, enhance our analysis
Automatic Classification or Clustering
Automatically determine categories based on patterns and relationships in Human Information
Spot analysis of all themes and grouping within Human Information at any moment in time
Time sensitive analysis; What’s hot? What’s New?
Supervised Classification
Create categories using business rules or training and classify information into those categories
Eduction and Entity Extraction
Extract features and determine characteristics in Human Information
Names, Addresses, Credit Card Information, Sentiment, Intent…..
Audio Analysis
Extract features of Audio information
Speaker independent speech to text, speaker identification, audio events, language identification…..
Image and Video Analysis
Extract features from Video information
Next generation image classification (is this a car?/find more like “this”)
On-screen OCR, logo detection, intelligent scene analysis, Colour and texture analysis, story segmentation….

Hundreds of conceptual entities
Eduction
Quickly narrow search results with auto-identified facets
and conceptual entities such as employee names from
documents
Validate or customize entities
• Is this a valid credit card number?
• What are all docs that contain SSNs?
• If area code is 415, output as Home Office
Pinpoint accuracy for multibyte languages such as CJK,
Thai and some European languages
Names
Places
IP addresses
Companies
Events
Relationships
Medicines
Airports
Cars
Social Security numbers
Phone numbers
Credit cards
Dates
Holidays
Job titles
Currencies
… many more
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

Eduction Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
<Organization>
• National Security Agency
<Names>
• President Obama
• Vladimir Putin
• Edward Snowden
<Places>
• Moscow
• St. Petersburg
• Washington
• Syria
• Russia
<Author>
• Carla Anne Robbins

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.26 © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Topical sentiment analysis
Decomposition and classification within a
sentence to pull out specific topics
“I stayed at the Marriott last week, and though the
mattresses were very nice, the service was awful.”
Is this Positive? Negative? Neutral?
How much Positive? How much Negative?
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

Search video as easily as text
Transform rich media into intelligent assets
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
Live video or
playback from
archived footage
On-screen text
recognition
Face identification
Automatically generated
transcript using speech
recognition
Speaker identification
Timecode
synchronization
Automatic keyframe
generation
Automate
Automatically create metadata,
keyframes, transcriptions
Understand
Understand video footage and
audio streams in real time
Act
Apply advanced analytics such as
clustering and categorization, and link
with other file types

Most advanced speech technology
Convert spoken words to text
• Acoustic + Language Model
• Speech-to-Text and IDOL’s conceptual understanding
Eliminate manually adding metadata to A/V clips
Phonetic approaches have major problems
• No Conceptual or Contextual Language Understanding
• Keyword-Based
Model of language disambiguates similar terms
• U.S. President “Bush”
• “bush” as in a large plant
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

Image technology: Text
Document field extraction
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
<item>
<price>$6.23</pri
ce>
<date>10/2/2012
</date>
<purpose>Lunch
</purpose>
…
</item>
OCR: Read text from images
1D and 2D barcode reading
ISBN
(“9870140189865”)
PDF-417 (“LASTNAME,
FIRSTNAME,…”)
Data Matrix
(“The Future of
Ticketing…”)
Many more (about 20
barcode types)
Image artifacts such as wrinkled paper
Avoid non-text parts of the image
Column understanding

Image technology: 2D objects
Registered image Test image
Generic Logo recognition
Registered
Logos
Test image
Inquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

Image technology: Human analysisInquire
“Search your
data”
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”
Primary clothing color =
white
Not nude
white
Not nude
black
Not nude
Face detection
Face analysis
Found “President Obama”
face

IDOL Architecture

IDOL architecture supporting next gen apps
Social Media Video Audio Email Texts Mobile Transactional
Data
Documents XML Search Engine Images
HP Autonomy
IDOL Applications
Autonomy Connectors
eDiscovery
Enterprise Search
Media
Monitoring
Social Media
Analytics
Decision
Support
Augmented
Reality
Partner/
In-house apps
HC Analytics
2D/3D clustering, Acoustic signature, Active matching, Agents, Alerting, Auto language detection, Auto query guidance, Boolean & legacy, Operations, Breaking news
clustering, Categorization, Collaboration, Community, Concept highlighting, Concept-query, Summarization, Conceptual retrieval, Context summarization, Cross-
modal suggest, Dynamic n-dimensional, Taxonomy generation, Dynamic XML, Consumption, Eduction, Exact phrase matching, Expertise location, Explicit
profiling, Face recognition, Field modulation, Frame analysis, Fuzzy matching, Hot clustering, Hyperlinking, Image analysis, Image association, Implicit profiling,
Keyword search, Mail object identification, Melody classification, Melody identification, Metadata recognition, Natural language retrieval, Object identification, Object
recognition, Ontology generation, Parametric refinement, Phrase spotting, Proper name identification, Query by example, Real-time aggregation, Routing,
Scene detection, Script alignment, Sentiment analysis, Soundex matching, Speaker identification, Speaker recognition, Spectographic analysis, Spell checking,
Tag reconciliation, Transcription, Video analysis, Voice printing, Word spotting, Work groups, XML tagging….
Repositories
Information
Types
Apps
500
Functions
IDOL Services Multimedia
Informatics
Enrichment
Capture
InteractionAnalytics
Discovery
Concept
Clouds
Active
MatchingVisualization
SharePoint, Hadoop, Email,
ERP,CRM, DB, Data Warehouse, Jive, …
ACA
MediaBin
Connected LiveVault
TRIM
AeD
Data Protector
WorkSite
DigitalSafe
Connectors
…
CloudEnterprise
IDOL
OS for Human Information

Hadoop Plays
1. Keyview used in Hadoop to extract text and metadata from data objects using map reduce
2. Connectors used to fill the data-lake from enterprise repositories
3. IDOL (in conjunction with 1 & 2) used to provide deep text analytics on data objects
Inquire
Investigate
“Analyze your
data”
Interact
“Personalize your
data”
Improve
“Enhance your
data”

HP IDOL for Hadoop – Potential use case
• HP Hadoop Connectors can ingest
data from other systems into Hadoop
• HP KeyView extracts text and
metadata from Hadoop data
• HP IDOL functions can be performed
on Hadoop data via SDK

Thank You

Robert Metzger
Flink committer
co-founder, data Artisans
@rmetzger_
rmetzger@data-artisans.com
Apache
Flink

What is Flink
 Collection programming APIs for batch and real-time
streaming analysis
 Backed by a very robust execution backend
• with true streaming capabilities,
• custom memory manager,
• native iteration execution,
• and a cost-based optimizer.
38

The case for Flink
 Performance and ease of use
• Exploits in-memory and pipelining, language-embedded logical APIs
 Unified batch and real streaming
• Batch and Stream APIs on top of streaming engine
 A runtime that "just works" without tuning
• C++ style memory management inside the JVM
 Predictable and dependable execution
• Bird’s-eye view of what runs and how, and what failed and why
39

Example: WordCount
40
case class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment
env.readTextFile(...)
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”).print()
env.execute()
Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.

Example: Window WordCount
41
case class Word (word: String, frequency: Int)
val env =
StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.fromSocketStream(...)
lines
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.window(Count.of(100)).every(Count.of(10))
.groupBy("word").sum("frequency”).print()
env.execute()

Defining windows
 Trigger policy
• When to trigger the computation on current window
 Eviction policy
• When data points should leave the window
• Defines window width/size
 E.g., count-based policy
• evict when #elements > n
• start a new window every n-th element
 Built-in: Count, Time, Delta policies
42

Flink API in a nutshell
 map, flatMap, filter, groupBy,
reduce, reduceGroup,
aggregate, join, coGroup,
cross, project, distinct, union,
iterate, iterateDelta, ...
 All Hadoop input formats are
supported
 API similar for data sets and
data streams with slightly
different operator semantics
 Window functions for data
streams
 Counters, accumulators, and
broadcast variables
43

Flink stack
44
Flink Optimizer Flink Stream Builder
Common API
Scala API Java API
Python API
(upcoming)
Graph API
(Gelly)
Apache
MRQL
Flink Local RuntimeEmbedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Data
storage
HDFSFiles S3 JDBC Flume
Rabbit
MQ
KafkaHBase …
Single node execution Standalone or YARN cluster

Technology inside Flink
 Technology inspired by compilers +
MPP databases + distributed systems
 For ease of use, reliable performance,
and scalability
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streamingTask
schedulin
g
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-flight
(client) Master
Workers

Notable runtime features
1. Pipelined data transfers
2. Management of memory
3. Native iterations
4. Program optimization
46

Staged (batch) execution
Romeo,
Romeo,
where art
thou
Romeo?
Loa
d
Log
Searc
h for
str1
Searc
h for
str2
Searc
h for
str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
48

Pipelined execution
Romeo,
Romeo,
where art
thou
Romeo?
Loa
d
Log
Searc
h for
str1
Searc
h for
str2
Searc
h for
str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:
Deploy and start operators
Data transfer in-
memory and disk if
needed 49
Note: Log
DataSet is
never
“created”!

Pipelining in Flink
 Currently the default mode of operation
• Much better performance in many cases – no
need to materialize large data sets
• Supports both batch and real-time streaming
 In the future pluggable
• Batch will use combination of blocking and
pipelining
• Streaming will use pipelining
• Interactive will use blocking
50

Memory management in Flink
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffling,
broadcasts
User code
objects
ManagedUnmanaged
52
Flink contains its own memory management stack. Memory is
allocated, de-allocated, and used strictly using an internal buffer pool
implementation. To do that, Flink contains its own type extraction and
serialization components.

Configuring Flink
 Per job
• Parallelism
 System config
• Total JVM heap size (-Xmx)
• % of total JVM size for Flink runtime
• Memory for network buffers (soon not needed)
 That's all you need. System will not throw an OOM
exception to you.
53

Benefits of managed memory
 More reliable and stable performance (less GC
effects, easy to go to disk)
54

Native iterative processing
55

Example: Transitive Closure
56
case class Path (from: Long, to: Long)
val env =
ExecutionEnvironment.getExecutionEnvironment
val edges = ...
val tc = edges.iterate (10) { paths: DataSet[Path] =>
val next = paths
.join(edges).where("to").equalTo("from") {
(path, edge) => Path(path.from, edge.to)
}
.union(paths).distinct()
next
}
tc.print()
env.execute()

Iterate natively
57
partial
solution
partial
solutionX
other
datasets
Y
initial
solution
iteration
result
Replace
Step function

Iterate natively with deltas
58
partial
solution
delta
setX
other
datasets
Y
initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset

Effect of delta iterations
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration

Iteration performance
60
MapReduce

Flink roadmap for 2015
 Unify batch and streaming
 Machine learning library and Mahout
 Graph processing library improvements
 Interactive programs and Zeppelin
 Logical queries and SQL
 And many more
62

Thank you for your invitation
 Check out the project
website:
http://flink.apache.org
 news@flink.apache.org
and other mailinglists
 Twitter: @ApacheFlink
 Feedback & contributions
welcome
63

Flink community
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
#unique contributors by git commits
(without manual de-dup)

Hadoop User Group
29th January 2015 @ HP
Text matching engine

About Capgemini
With more than 130,000 people in over 40
countries, Capgemini is one of the world's
foremost providers of consulting,
technology and outsourcing services. The
Group reported 2013 global revenues of
EUR 10.1 billion. Together with its clients,
Capgemini creates and delivers business
and technology solutions that fit their
needs and drive the results they want. A
deeply multicultural organization,
Capgemini has developed its own way of
working, the Collaborative Business
ExperienceTM, and draws on Rightshore®, its
worldwide delivery model.
Learn more about us
at www.capgemini.com.
Rightshore® is a trademark belonging to
Capgemini
About Capgemini
67© 2015 Capgemini.All rights reserved.

Capgemini Global BIM Service Line
 Capgemini ‘s global reach with operations in 44
countries and a focus on BIM with over 9600
BIM practitioners.
 A uniquely integrated approach to Information
Strategy based around the Capgemini
“Intelligence Enterprise”.
 Deep Industry sector knowledge supported by
Sector Specific BIM offerings.
 Capgemini’s best-in-class Rightshore®
capability for BIM for development and
management of BIM – 4000 BIM experts in
India CoE.
 A unmatched (and vendor independent) depth
of technology experience. Capgemini works
with all the major BI software vendors to deliver
solutions appropriate to the customer’s needs.
850+ M EUR revenue in 2013
Europe:
South Africa
Argentina
Brazil
Mexico
United States
Canada
Saudi Arabia India
Australia
China
Morocco
Austria
Finalnd
France
Italy
Germany
Norway
Netherlands
Poland
Spain
Sweden
Switzerland
UK

Contact information
Please contact:
• Edmond SEGALEN
edmond.segalen@capgemini.com
• Mouloud LOUNACI
mouloud.lounaci@capgemini.com
• Nathalie SIMON
nathalie.simon@capgemini.com

The matching process in 3 steps
7
6 months of
applications
(~200 000)
7 days of offers
(~50 000)
Cleaning up
documents
Vectorization
of documents
~10 billons of possible
combinations
Similarity
computation

Cleaning Up the documents with Apache Lucene (UDF Hive)
• Removing the useless words by :
• A French vocabulary analyzer (general useless words in analysis : le, la, … articles..)
• A customized dictionary defined by users (specific useless words/regex such as: email, addresses,
numbers/dates ..)
• Extracting roots of remaining words (stemming) :
• Get rids off gender and plural problems
• Correct some possible misspelling in job applications
Formation en production mecanique bac+4
Experience en management en tant que
responsable de service qualite :service de
controle, de validation et de production de
lavage de pieces
clients; Sydeb Renault, GM Strasbourg,
ACMS Mecachrome Encadrement et
supervision du personnel du service qualite
: …
form production mecan bac experienc
manag tant responsabl servic qualit servic
control valid production
lavag piec client sydeb renault gm
strasbourg acm mecachrom encadr
supervision personel servic qualit …
+
Cleaning up the documents

1 – Corpus dictionary
Key: a.a: Value: 0
Key: a.a.c: Value: 1…
Key: form : Value: 1474…
Key: production : Value:
15500 …
Normalized TFIDF vectors
Doc_1: { (w1; tfidf_1); (w2 ; tfidf_2); … }
Doc_2: { (w_1; tfidf_1); (w_2 ; tfidf_2); … }
…
Normalized Vectors
The documents vectorization
transform texts inputs into
comparable quantifiable
mathematical objects : vectors
Vector Basis
(~ 1,2 million words/pairs of words)
2 – Relative weight
TFIDF(mot, doc)
= TF / DF
• TF (Term Frequency)
• DF (Document Frequency)
Documents
(Applications +
Offers)
Weight word
1
Weight word
2
…
Weight word
X
Doc_ 1 0.53 0.93 …
0
Doc_2 0 0.89 …
0.12
… … … … …
Vector coordinates
TIDF
“Vectorization” of documents

Similarity coefficient = Cosine between 2 vectors
measure of the TFIDF angle ( positive ) between 0 et +1
In SQL :
Independent information Exact same information
90° 0°…
- id_CV
- id_word
- tfidf
Offer
- id_offer
- id_word
- tfidf
SELECT
id_offer
,id_cv
,SUM(cv.tfidf*offer.tfidf) cos_sim
FROM offer
INNER JOIN CV ON offer.id_mot = CV.id_mot
GROUP BY id_offer, id_cv
Application
id_word
=
id_word
=1=1
Similarity process

Facts from the field…
 JOB OFFER
 Indeed offer Operator / Operator of the chemical
manufacturing In a company dedicated to the
production and packaging of chemicals such as
solvents, aerosols, greases for vehicle engines,
you'll be loads of different product mixes
 Similarity = 17%
XXXX : Technician manufacturing Chemical industry
AREAS OF EXPERTISE :
Monitor compliance on instruments…
 Similarity = 13%
YYYY : Operator chemistry
PROFESSIONAL SKILLS Manipulation measurement tools and
controls
 Similarity = 9,7%
ZZZZ
Weighing and metering products ehiniiques manipulation tool
controle Ph meter, meter, microscope ...
Procedures for cleaning and disinfection...
Similarity > 12%: High confidence Similarity
9-12%: Moderate confidence Similarity 5-
9%: Risky match
Similarity < 5%: High risk match

Merci
29 janvier 2015

Hadoop User Group Agenda and Presentations

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Hadoop User Group Agenda and Presentations

Similar a Hadoop User Group Agenda and Presentations (20)

Más de Modern Data Stack France

Más de Modern Data Stack France (20)

Último

Último (20)

Hadoop User Group Agenda and Presentations