This talk looks at the ways in which search engines are evolving to understand further the nuance of linguistics in natural language processing and in understanding searcher intent.
18. But I will be talking about some concepts
covering:
Data Science
01
Information
Retrieval
02
Algorithms
03
Linguistics
04
Information
Architecture
05
Library
Science
Category
theory
19. Since… These are all areas
connected to how search
engines (try to) find the
right information, for the
right informational need at
the right time for the right
user
20. ‘information retrieval’ in web search
To extract informational resources to meet a
search engine user’s information need at time
of query.
21. Let us first take a very
simplistic look at how we
know search engines work
22. It’s just like gathering & organizing
books in a library system or using an
old card index system
23. But instead we are taking
words (or phrases) and
recording where they live
34. Relevance Matching to Query Requires:
Understanding meaning of words in content & query (What?)
Understanding meaning of word's context in content & query (What?)
Understanding of user’s context (Who / Where / When / Why?)
Understanding of collaboration (Past queries / popularity /
reinforcement / learning to rank)
43. Many websites
(and
webpages) are
not logically
organized at all
Unstructured data is voluminous
Filled with irrelevance
Lacks focus
Riddled with nuance
Lots of meaningless text and further
ambiguating jabber
44. Most text-filled web pages
could be considered
unstructured, noisy data
Blog == Blah Blah
45. Structured versus unstructured data
• Structured data – high
degree of organization
• Readily searchable by
simple search engine
algorithms or known search
operators (e.g. SQL)
• Logically organized
• Often stored in a relational
database
46. When we compare them with highly organized relational database systems
47. A form of structured (& semi-structured) data – Entities, Knowledge
Graphs, Knowledge Bases & Knowledge Repositories
48. “Entities help to bridge the
gap between structured
and unstructured data”
(Krisztian Balog, ECIR2019
Keynote)
56. Since website
is NOT ALL
unstructured
data even
before
structured
data markup
It can have a hierarchy
It can have weighted sections
It can have metadata
It (often) has a tree like structure
57. As long as there is
understanding of
notions of
categorical
‘inheritance’
60. Semi-
structured
data
• Hierarchical nature of a
website
• Tree structure
• Well sectioned and
including clear containers
and meta headings
• An ontology map between
semi and structured
77. When they can’t even tell the difference between Pomeranians and pancakes
78. They need
‘Text
cohesion’
Cohesion is the grammatical and
lexical linking within a text
or sentence that holds a text
together and gives it meaning.
Without surrounding words the
word bucket could mean
anything in a sentence
98. Coast and
Shore
Example
Coast and shore have a similar
meaning
They co-occur in first and second
level relatedness documents in a
collection
They would receive a high score in
similarity
99. Language models are trained on
very large text corpora or
collections (loads of words) to
learn distributional similarity
103. Continuous Bag of Words (CBoW)
(Method) or Skip-gram (Opposite of
CBoW)
Continuous Bag of Words -
Taking a continuous bag of
words with no context utilize a
context window of n size n-gram)
to ascertain words which are
similar or related using Euclidean
distances to create vector
models and word embeddings
113. Most language modellers are uni-directional
Source Text
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
They can traverse over the word’s context window from only left to right or
right to left. Only in one direction, but not both at the same time
114. They can only look at words in the context
window before and not the words in the rest of
the sentence. Nor sentence to follow next
120. BERT is different. BERT uses bi-directional
language modelling. The FIRST to do this
Source Text
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Bert can see both the left and the right hand side of the target word
144. There are also
several types of
queries too
(Krisztian Balog,
ECIR, 2019)
Keyword queries (Normal keyword queries)
Keyword++ queries (Faceted / filtered
queries)
Zero-Query queries (User is the query)
Natural language queries
Structured queries (e.g. SQL)
168. What did you really mean when you searched for ‘Easter’?
When did
you search
for ‘Easter’?
A few weeks
before Easter
A few days
before Easter
During Easter
What you
mostly meant
When is
Easter?
Things to do
at Easter
What is the
meaning of
Easter?
Radinsky, K., Svore, K.M., Dumais, S.T., Shokouhi, M., Teevan, J., Bocharov, A. and
Horvitz, E., 2013. Behavioral dynamics on the web: Learning, modeling, and
prediction. ACM Transactions on Information Systems (TOIS), 31(3), p.16.
193. Different features matter to users
more dependent on the domain
News (freshness)
Jobs (salary, job title, location)
Restaurants (location, cuisine)
Shopping (price)
194. In theory… a consolidated page should rank higher… but…
217. • Balog, K - Entity-Oriented Search | SpringerLink. 2019. Entity-Oriented Search |
SpringerLink. [ONLINE] Available at: https://link.springer.com/book/10.1007/978-
3-319-93935-3. [Accessed 06 May 2019].
• Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic
models. Foundations and Trends® in Information Retrieval, 11(2-3), pp.143-296.
• ECIR 2019. 2019. Proceedings. [ONLINE] Available
at: http://ecir2019.org/proceedings/. [Accessed 06 May 2019].
• Gabrilovich, E. and Markovitch, S., 2007, January. Computing semantic relatedness
using wikipedia-based explicit semantic analysis. In IJcAI (Vol. 7, pp. 1606-1611).
• Hakkani-Tur, D., Tur, G., Li, X. and Li, Q., Microsoft Technology Licensing LLC,
2017. Personal knowledge graph population from declarative user utterances. U.S.
Patent Application 14/809,243.
• Lim, Y.J., Linn, J., Liang, Y., Steinebach, C., Lu, W.L., Kim, D.H., Kunz, J., Koepnick, L.
and Yang, M., Google LLC, 2018. Predicting intent of a search for a particular
context. U.S. Patent Application 15/598,580.
• Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity, M., 2018.
Advances in Computational Intelligence Systems. Intelligence.
218. • Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA: Fast
document aligner using word embedding. The Prague Bulletin of
Mathematical Linguistics, 106(1), pp.169-179.
• McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep relevance
ranking using enhanced document-query interactions. arXiv preprint
arXiv:1809.01682.
• NTENT. 2019. Query Understanding - NTENT. [ONLINE] Available
at: https://ntent.com/technology/query-understanding/. [Accessed 09 May
2019].
• Plank, Barbara | Keynote - Natural Language Processing: -
https://www.youtube.com/watch?v=Wl6c0OpF6Ho
• Radinsky, Kira - Tedx Talk -
https://www.youtube.com/watch?v=gAifa_CVGCY
• Radinsky, K., 2012, December. Learning to predict the future using Web
knowledge and dynamics. In ACM SIGIR Forum(Vol. 46, No. 2, pp. 114-115).
ACM.
219. • https://www.youtube.com/watch?v=Ozpek_FrOPs
• Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of
wikipedia concepts and entities. In International conference on
applications of natural language to information systems (pp. 418-
428). Springer, Cham.
• Syed, U., Slivkins, A. and Mishra, N., 2009. Adapting to the shifting
intent of search queries. In Advances in Neural Information Processing
Systems (pp. 1829-1837).
223. • http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-
gram-model/
• Semantic similarity and relatedness as scaffolding for natural
language processing ->
https://www.youtube.com/watch?v=YTBVfQ8iBSo
• gensim: models.word2vec – Word2vec embeddings. 2019. gensim:
models.word2vec – Word2vec embeddings. [ONLINE] Available
at: https://radimrehurek.com/gensim/models/word2vec.html.
[Accessed 09 May 2019].
Notas del editor
Anyway… back to words. Words are everywhere.
Is there any point in returning results from nearby if the speed at which the user is travelling will render the result useless in just a minute or two?
Better to utilize tools such as the accelerometer and understand the direction and speed that the user is going in to return the most appropriate results as suggestions
Their ambiguous and polysemic nature mean that search engines have to try to disambiguate their meaning in order to understand what the searcher meant and also to understand what the content me