SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
Interpretable Topic Modeling Using Near-Identity
Cross-Document Coreference Resolution
Anastasia Zhukova1, Felix Hamborg2, Bela Gipp1
Motivation
• Topic modelling is a widely-known technique for
document representation, summarization, and
classification.
• Approaches such as LDA, NMF, pLSA are state-of-the-
art techniques to reduce text dimensionality and
summarize documents.
• Often, topic modelling approaches produce
statistically valid yet hard to interpret topics [1,2,6,7,8].
• On the contrary, cross-document coreference
resolution (CDCR) aims at identification of clearly
defined events and entities [3, 4].
• SoTA CDCR methods identify narrow events and
entities well but suffer from lack of generalization
and summarization across identified concepts [4].
Contact
Web: https://dke.uni-wuppertal.de/en/
Email: zhukova@uni-wuppertal.de, felix.hamborg@uni.kn
1 University of Wuppertal, 2 University of Konstanz
Research Question
How to extract more interpretable and
representative topics with CDCR-based methods
from text documents?
NI-CDCR Topic Modelling
Target Concept Analysis (TCA) [4] is a CDCR approach that
resolves concept mentions with identity and near-
identity relations and uses concepts as topics.
260
References
[1] Loulwah Alsumait and Daniel Barbar. 2009. Topic Significance Ranking of LDA Generative Models. In Machine
Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia,
September 7-11, 2009, Proceedings, Part I, 67–82. [2] Jonathan Chang et al. 2009. Reading Tea Leaves: How
Humans Interpret Topic Models. Journal of Chemical Information and Modeling 53, (2009), 1689–1699. [3] Felix
Hamborg et al. 2019. Automated Identification of Media Bias by Word Choice and Labeling in News Articles.
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019), 196–205. [4] Felix Hamborg et al.
2019. Illegal Aliens or Undocumented Immigrants ? Towards the Automated Identification of Bias by Word Choice
and Labeling. in Proceedings of the iConference 2019 (2019). [5] Han Kyul Kim et al. 2017. Bag-of-concepts:
Comprehending document representation through clustering words in distributed representation.
Neurocomputing 266, (2017), 336–352. [6] Jey Han Lau et al. 2014. Machine Reading Tea Leaves: Automatically
Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter
of the Association for Computational Linguistics (2014), 530–539. [7] David Newman et al. 2010. Evaluating topic
models for digital libraries. January (2010), 215. [8] Hanna M Wallach and David Mimno. 2009. Evaluation Methods
for Topic Models. In Proceedings of the 26th annual international conference on machine learning, 1105–1112.
Future Work
• Evaluate the proposed approach using likelihood and
perplexity.
• Evaluate using more diverse real-world text datasets
• Improve the quality of CDCR-based topics.
Qualitative Evaluation
• Preliminary evaluation on NewsWCL50 [4]
o a news articles dataset for concept identification with large
lexical diversity.
• TCA extracts topics that distinctively describe
meaningful parts, e.g., entities: actors, actions,
locations.
Method Most representative terms of one topic
LSA Mr. Trump, say, Iran, nuclear, Comey, deal
LDA Comey, Trump, memo, write, president, Thursday
NMF visit, London, Trump, Khan, protest, July, state
BoC FBI, clandestine, Watergate, CIA, Comey, intel
TCA Russia investigation, Mr. Mueller's investigation, an
investigation into Russia's election meddling, a probe
into Russian interference
Quantitative Evaluation
• TCA’s coherence Cv = 0.52 < 0.68 achieved by NMF
• ↓ coherence, but ↑ #topics and ↑ interpretability
Method Unit Cv # topics Topic type
LSA L, B 0.47 2 Abstract
LDA L, B 0.62 10 Summary of events
NMF L, B 0.68 10 Summary of events
BoC L, B 0.52 534 Concepts
TCA L, P 0.52 97 Concepts
DACA recipients
undocumented
immigrants
illegally
brought minors
US military action
military
presence
military
threat
Document representation:
topic
topic’s
words
Three types of properties analysed in merging sieves:
Sieves 1-2 Core meaning of phrases
L – lemmas B – bigrams (freq. > 5) P – phrases (NPs, VPs)
Sieves 3-4
similar properties of two phrases?
→ merge as referring to one concept
TCA’s resolution principle:
Normalized frequencies of topic’s words in each doc
→ weighted distribution of topics
Donald Trump ↔ President
Core meaning modifiers
undocumented + immigrants ↔ DACA + recipients
Sieves 5-6 Frequent word patterns
White House officials ↔ White House representatives

Más contenido relacionado

Similar a Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution

Similar a Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution (20)

Creating metadata for data visualization
Creating metadata for data visualizationCreating metadata for data visualization
Creating metadata for data visualization
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
Introduction to Nvivo
Introduction to NvivoIntroduction to Nvivo
Introduction to Nvivo
 
Coding.ppt
Coding.pptCoding.ppt
Coding.ppt
 
Overview of the dissertation process. Tools, techniques by phase.
Overview of the dissertation process. Tools, techniques by phase.Overview of the dissertation process. Tools, techniques by phase.
Overview of the dissertation process. Tools, techniques by phase.
 
Workshop unpad2014 with ref
Workshop unpad2014 with refWorkshop unpad2014 with ref
Workshop unpad2014 with ref
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docx
 
Guestion paper
Guestion paperGuestion paper
Guestion paper
 
Argument Structures Of Political Debates
Argument Structures Of Political DebatesArgument Structures Of Political Debates
Argument Structures Of Political Debates
 
Introduction To Critical Enquiry Research
Introduction To Critical Enquiry ResearchIntroduction To Critical Enquiry Research
Introduction To Critical Enquiry Research
 
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big DataIRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
 
Probabilistic Topic Models
Probabilistic Topic ModelsProbabilistic Topic Models
Probabilistic Topic Models
 
Cultural text mining workshop
Cultural text mining workshopCultural text mining workshop
Cultural text mining workshop
 
Dr.saleem gul assignment summary
Dr.saleem gul assignment summaryDr.saleem gul assignment summary
Dr.saleem gul assignment summary
 
Content Analysis Overview for Persona Development
Content Analysis Overview for Persona DevelopmentContent Analysis Overview for Persona Development
Content Analysis Overview for Persona Development
 
Content analysis
Content analysisContent analysis
Content analysis
 
Content analysis
Content analysisContent analysis
Content analysis
 
Modeling Causal Reasoning in Complex Networks through NLP: an Introduction
Modeling Causal Reasoning in Complex Networks through NLP: an IntroductionModeling Causal Reasoning in Complex Networks through NLP: an Introduction
Modeling Causal Reasoning in Complex Networks through NLP: an Introduction
 
Introduction and Tools of Research
Introduction and Tools of ResearchIntroduction and Tools of Research
Introduction and Tools of Research
 
Detecting subtle text manipulations
Detecting subtle text manipulationsDetecting subtle text manipulations
Detecting subtle text manipulations
 

Más de Anastasia Zhukova

M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
Anastasia Zhukova
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Anastasia Zhukova
 
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Anastasia Zhukova
 
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Anastasia Zhukova
 
Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...
Anastasia Zhukova
 

Más de Anastasia Zhukova (10)

What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...What's in the News? Towards Identification of Bias by Commission, Omission, a...
What's in the News? Towards Identification of Bias by Commission, Omission, a...
 
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
Seminar Paper: Putting News in a Perspective: Framing by Word Choice and Labe...
 
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
M.Sc. Thesis: Automated Identification of Framing by Word Choice and Labeling...
 
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
Talk: Automated Identification of Media Bias by Word Choice and Labeling in N...
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
 
Putting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and LabelingPutting News in a Perspective: Framing by Word Choice and Labeling
Putting News in a Perspective: Framing by Word Choice and Labeling
 
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
Interpretable and Comparative Textual Dataset Exploration Using Near-Identity...
 
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
Towards Evaluation of Cross-document Coreference Resolution Models Using Data...
 
Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...Concept Identification of Directly and Indirectly Related Mentions Referring ...
Concept Identification of Directly and Indirectly Related Mentions Referring ...
 
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
ANEA: Automated (Named) Entity Annotation for German Domain-Specific TextsANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
 

Último

THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Último (20)

THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 

Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution

  • 1. Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution Anastasia Zhukova1, Felix Hamborg2, Bela Gipp1 Motivation • Topic modelling is a widely-known technique for document representation, summarization, and classification. • Approaches such as LDA, NMF, pLSA are state-of-the- art techniques to reduce text dimensionality and summarize documents. • Often, topic modelling approaches produce statistically valid yet hard to interpret topics [1,2,6,7,8]. • On the contrary, cross-document coreference resolution (CDCR) aims at identification of clearly defined events and entities [3, 4]. • SoTA CDCR methods identify narrow events and entities well but suffer from lack of generalization and summarization across identified concepts [4]. Contact Web: https://dke.uni-wuppertal.de/en/ Email: zhukova@uni-wuppertal.de, felix.hamborg@uni.kn 1 University of Wuppertal, 2 University of Konstanz Research Question How to extract more interpretable and representative topics with CDCR-based methods from text documents? NI-CDCR Topic Modelling Target Concept Analysis (TCA) [4] is a CDCR approach that resolves concept mentions with identity and near- identity relations and uses concepts as topics. 260 References [1] Loulwah Alsumait and Daniel Barbar. 2009. Topic Significance Ranking of LDA Generative Models. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part I, 67–82. [2] Jonathan Chang et al. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. Journal of Chemical Information and Modeling 53, (2009), 1689–1699. [3] Felix Hamborg et al. 2019. Automated Identification of Media Bias by Word Choice and Labeling in News Articles. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019), 196–205. [4] Felix Hamborg et al. 2019. Illegal Aliens or Undocumented Immigrants ? Towards the Automated Identification of Bias by Word Choice and Labeling. in Proceedings of the iConference 2019 (2019). [5] Han Kyul Kim et al. 2017. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 266, (2017), 336–352. [6] Jey Han Lau et al. 2014. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (2014), 530–539. [7] David Newman et al. 2010. Evaluating topic models for digital libraries. January (2010), 215. [8] Hanna M Wallach and David Mimno. 2009. Evaluation Methods for Topic Models. In Proceedings of the 26th annual international conference on machine learning, 1105–1112. Future Work • Evaluate the proposed approach using likelihood and perplexity. • Evaluate using more diverse real-world text datasets • Improve the quality of CDCR-based topics. Qualitative Evaluation • Preliminary evaluation on NewsWCL50 [4] o a news articles dataset for concept identification with large lexical diversity. • TCA extracts topics that distinctively describe meaningful parts, e.g., entities: actors, actions, locations. Method Most representative terms of one topic LSA Mr. Trump, say, Iran, nuclear, Comey, deal LDA Comey, Trump, memo, write, president, Thursday NMF visit, London, Trump, Khan, protest, July, state BoC FBI, clandestine, Watergate, CIA, Comey, intel TCA Russia investigation, Mr. Mueller's investigation, an investigation into Russia's election meddling, a probe into Russian interference Quantitative Evaluation • TCA’s coherence Cv = 0.52 < 0.68 achieved by NMF • ↓ coherence, but ↑ #topics and ↑ interpretability Method Unit Cv # topics Topic type LSA L, B 0.47 2 Abstract LDA L, B 0.62 10 Summary of events NMF L, B 0.68 10 Summary of events BoC L, B 0.52 534 Concepts TCA L, P 0.52 97 Concepts DACA recipients undocumented immigrants illegally brought minors US military action military presence military threat Document representation: topic topic’s words Three types of properties analysed in merging sieves: Sieves 1-2 Core meaning of phrases L – lemmas B – bigrams (freq. > 5) P – phrases (NPs, VPs) Sieves 3-4 similar properties of two phrases? → merge as referring to one concept TCA’s resolution principle: Normalized frequencies of topic’s words in each doc → weighted distribution of topics Donald Trump ↔ President Core meaning modifiers undocumented + immigrants ↔ DACA + recipients Sieves 5-6 Frequent word patterns White House officials ↔ White House representatives