SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Language Sleuthing HOWTO
                 or
   Discovering Interesting Things
           with Python's
     Natural Language Tool Kit


                           Brianna Laugher
                                modernthings.org
                         brianna[@.]laugher.id.au
Corpus linguistics on web
          texts




          why?
Because the web is full of
       language data

 Because linguistic techniques
can reveal unexpected insights

Because I don't want to have to
       read everything
Like... mailing lists
luv-main as a corpus



√ Big collection of text
x Messy data
x Not annotated
what's interesting?

   conversations

      topics

 change over time

     (authors)
Step 1:




get the data
wget vs Python script


√ wget is purpose-built

√ convenient options like
   --convert-links
Meaningful URLs FTW


              Sympa/MhonArc:


lists.luv.asn.au/wws/arc/luv-main/
                                 2009-04/
                                         msg00057.html
Step 2:




clean the data
Cleaning for what?

Remove archive boilerplate

      Remove HTML

   Remove quoted text?

   Remove signatures?
J.W.
J.W.




       W.E.
Behind the scenes
        J.W.




 W.E.
what are we aiming for?




what do NLTK corpora look like?
Getting NLTK


sudo apt-get install python-nltk
         in Ubuntu 10.04
                 or
sudo apt-get install python-pip
         pip install nltk
                 or
  from source at nltk.org/download
Getting NLTK data...




    an “NLTKism”
NLTK corpora types
Brown corpus
A CategorizedTagged corpus:

   Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in
clearing/vbg up/in any/dti possible/jj
misconception/nn in/in your/pp$ minds/nns ,/,
wherever/wrb you/ppss are/ber ./.
The/at collective/nn by/in which/wdt I/ppss
address/vb you/ppo in/in the/at title/nn above/rb
is/bez neither/cc patronizing/vbg nor/cc jocose/jj
but/cc an/at exact/jj industrial/jj term/nn in/in
use/nn among/in professional/jj thieves/nns ./.
Inaugural corpus
A Plaintext corpus:

My fellow citizens:

I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful
of the sacrifices borne by our ancestors. I thank
President Bush for his service to our nation, as
well as the generosity and cooperation he has
shown throughout this transition.

Forty-four Americans have now taken the
presidential oath. ...............
But we still have lots of HTML...
BeautifulSoup to the rescue



>>>   from BeautifulSoup import BeautifulSoup as BS
>>>   data = open(filename,'r').read()
>>>   soup = BS(data)
>>>   print 'n'.join(soup.findAll(text=True))
notice the blockquote!
What about blockquotes?

>>> bqs = s.findAll('blockquote')
>>> [bq.extract() for bq in bqs]
>>> print 'n'.join(s.findAll(text=True))

On 05/08/2007, at 12:05 PM, [...] wrote:
If u want it USB bootable, just burn the DSL boot disk to CD and fire it
up.  Then from the desktop after boot, right click and create the
bootable USB key yourself.  I havent actually done this myself (only
seen the option from the menu), but I am assuming it will be a fairly painless
process if you are happy with the stock image.  Would be interested in
how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
Regards,
[...]
Step 3:




analyse the data
Getting it into NLTK



import nltk
path = 'path/to/files'
corpus = nltk.corpus.PlaintextCorpusReader(path,
                                     '.*.html')
What about our metadata?
Create a Python dictionary that maps filenames to
categories
e.g.
categories={}
categories['2008-12/msg00226.html'] =
                    ['year-2008',
                      'month-12',
                      'author-BM<bm@xxxxx>'
                    ]
....etc
then...
import nltk
path = 'path/to/files/'
corpus =
nltk.corpus.CategorizedPlaintextCorpusReader(path,
                    '.*.html', cat_map=categories)
Simple categories


cats = corpus.categories()
authorcats=[c for c in cats if c.startswith('author')]
#>>> len(authorcats)
#608
yearcats=[c for c in cats if c.startswith('year')]
monthcats=[c for c in cats if c.startswith('month')]
...who are the top posters?
posts = [(len(corpus.fileids(author)), author) for author in
authorcats]
posts.sort(reverse=True)

for count, author in posts[:10]:
   print "%5dt%s" % (count, author)

→

 1304    author-JW
 1294    author-RC
 1243    author-CS
 1030    author-JH
  868    author-DP
  752    author-TWB
  608    author-CS#2
  556    author-TL
  452    author-BM
  412    author-RM
(email   me if you're curious to know if you're on it...)
Frequency distributions
popular =['ubuntu','debian','fedora','arch']
niche = ['gentoo','suse','centos','redhat']

def getcfd(distros,limit):
  cfd = nltk.ConditionalFreqDist(
     (distro, fileid[:limit])
     for fileid in corpus.fileids()
     for w in corpus.words(fileid)
     for distro in distros
     if w.lower().startswith(distro))
  return cfd

popularcfd = getcfd(popular,4) # or 7 for months
popularcfd.plot()
nichecfd = getcfd(niche,4)
nichecfd.plot()
                       another “NLTKism”
'Popular' distros by month
'Popular' distros by year
'Niche' distros by year
Random text generation
import random
words = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_model(cfdist, word, num=15):
    for i in range(num):
       print word,
       words = list(cfdist[word])
       word = random.choice(words)

text = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'hi', num=20)
hi...
hi allan : ages since apparently yum erased . attempts
now venturing into config run ip 10 431 ms 57

hi serg it illegal address entries must *, t close relative info
many families continue fi into modem and reinstalled

hi wen and amended :) imageshack does for grade service
please blame . warning issued an overall environment
consists in

hi folks i accidentally due cause excitingly stupid idiots ,
deletion flag on adding option ? branded ) mounting them

hi guys do composite required </ emulator in for
unattended has info to catalyse a dbus will see atz init3
hi from Peter...
text = [w.lower() for w in corpus.words(categories=
          [c for c in authorcats if 'PeterL' in c])]


hi everyone , hence the database schema and that run on memberdb on mail
store is 12 . yep ,

hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
of failure .

hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
g4 ibook here

hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
host basis

hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
! now ). txt

hi cameron , attribution for 30 seconds , and runs out on linux to on www .
luv , these
interesting collocations
                              ...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)

finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
→
bufnewfile bufread
busmaster speccycle
cellx celly
cheswick bellovin
cread clocal
curtail atl
dmcrs rscem
dmmrbc dmost
dmost dmcrs
...
oblig tag cloud


stopwords =
nltk.corpus.stopwords.words('english')
words = [w.lower() for w in corpus.words()
                                if w.isalpha()]
words = [w for w in words if w not in stopwords]
word_fd = nltk.FreqDist(words)
wordmax = word_fd[word_fd.max()]
wordmin = 1000 #YMMV
taglist = word_fd.items()
ranges = getRanges(wordmin, wordmax)
writeCloud(taglist, ranges, 'tags.html')
another one for Peter :)
cats =  [c for c in corpus.categories()
               if 'PeterL' in c]
words=[w.lower() for w in corpus.words(categories=cats)
                         if w.isalpha()]
wordmin = 10
  →
thanks!
for more corpus fun:
http://www.nltk.org/
                             The Book:
       'Natural Language Processing
                         with Python',
                 2nd ed. pub. Jan 2010



      These slides are © Brianna Laugher and are released under
           the Creative Commons Attribution ShareAlike license,
                    v3.0 unported. The data set is not free, sadly...

Más contenido relacionado

La actualidad más candente

Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
niklal
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
Logicaltrust pl
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
Jesse Vincent
 

La actualidad más candente (20)

Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Python for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationPython for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administration
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 

Destacado

Destacado (7)

Beyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBeyond Open Source - Arthur Sale
Beyond Open Source - Arthur Sale
 
Free and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerFree and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon Greener
 
Wikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitWikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & Profit
 
Future directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesFuture directions for copyright law - Laura Simes
Future directions for copyright law - Laura Simes
 
Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)
 
CFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareCFFSW - Crowdfunded free software
CFFSW - Crowdfunded free software
 
Special:Contributions/newbies
Special:Contributions/newbiesSpecial:Contributions/newbies
Special:Contributions/newbies
 

Similar a Language Sleuthing HOWTO with NLTK

Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routers
Yury Chemerkin
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
Jeff Miccolis
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
Peter Higgins
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
dotCloud
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009
Dr Nic Williams
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
oscon2007
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java Programming
Katy Allen
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
amit kuraria
 

Similar a Language Sleuthing HOWTO with NLTK (20)

Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
What is Python?
What is Python?What is Python?
What is Python?
 
Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routers
 
Intro
IntroIntro
Intro
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015
 
Rust: Reach Further
Rust: Reach FurtherRust: Reach Further
Rust: Reach Further
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java Programming
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 

Más de Brianna Laugher

Más de Brianna Laugher (20)

So You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthSo You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career Growth
 
Dynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookDynamic viz in the IPython Notebook
Dynamic viz in the IPython Notebook
 
Funcargs & other fun with pytest
Funcargs & other fun with pytestFuncargs & other fun with pytest
Funcargs & other fun with pytest
 
Zookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareZookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management software
 
BarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text field
 
Distributed wikis
Distributed wikisDistributed wikis
Distributed wikis
 
Neurosexism
NeurosexismNeurosexism
Neurosexism
 
Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?
 
Visualising geo-data
Visualising geo-dataVisualising geo-data
Visualising geo-data
 
Wiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIWiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki API
 
GLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureGLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructure
 
The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)
 
Free as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellFree as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty Russell
 
Public history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhPublic history in the digital age - Claudine Chionh
Public history in the digital age - Claudine Chionh
 
It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...
 
Gratis & libre - Liam Wyatt
Gratis & libre - Liam WyattGratis & libre - Liam Wyatt
Gratis & libre - Liam Wyatt
 
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerOpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
 
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
 
Who's behind Wikipedia?
Who's behind Wikipedia?Who's behind Wikipedia?
Who's behind Wikipedia?
 
How Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleHow Free Software makes Wikipedia possible
How Free Software makes Wikipedia possible
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Language Sleuthing HOWTO with NLTK

  • 1. Language Sleuthing HOWTO or Discovering Interesting Things with Python's Natural Language Tool Kit Brianna Laugher modernthings.org brianna[@.]laugher.id.au
  • 2. Corpus linguistics on web texts why?
  • 3. Because the web is full of language data Because linguistic techniques can reveal unexpected insights Because I don't want to have to read everything
  • 5. luv-main as a corpus √ Big collection of text x Messy data x Not annotated
  • 6. what's interesting? conversations topics change over time (authors)
  • 8. wget vs Python script √ wget is purpose-built √ convenient options like --convert-links
  • 9. Meaningful URLs FTW Sympa/MhonArc: lists.luv.asn.au/wws/arc/luv-main/ 2009-04/ msg00057.html
  • 10.
  • 12. Cleaning for what? Remove archive boilerplate Remove HTML Remove quoted text? Remove signatures?
  • 13. J.W. J.W. W.E.
  • 14. Behind the scenes J.W. W.E.
  • 15. what are we aiming for? what do NLTK corpora look like?
  • 16. Getting NLTK sudo apt-get install python-nltk in Ubuntu 10.04 or sudo apt-get install python-pip pip install nltk or from source at nltk.org/download
  • 17. Getting NLTK data... an “NLTKism”
  • 18.
  • 20. Brown corpus A CategorizedTagged corpus: Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./. The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
  • 21. Inaugural corpus A Plaintext corpus: My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. ...............
  • 22. But we still have lots of HTML...
  • 23.
  • 24. BeautifulSoup to the rescue >>> from BeautifulSoup import BeautifulSoup as BS >>> data = open(filename,'r').read() >>> soup = BS(data) >>> print 'n'.join(soup.findAll(text=True))
  • 25.
  • 27. What about blockquotes? >>> bqs = s.findAll('blockquote') >>> [bq.extract() for bq in bqs] >>> print 'n'.join(s.findAll(text=True)) On 05/08/2007, at 12:05 PM, [...] wrote: If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.&#xA0; Then from the desktop after boot, right click and create the bootable USB key yourself.&#xA0; I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.&#xA0; Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks. Regards, [...]
  • 29. Getting it into NLTK import nltk path = 'path/to/files' corpus = nltk.corpus.PlaintextCorpusReader(path, '.*.html')
  • 30. What about our metadata? Create a Python dictionary that maps filenames to categories e.g. categories={} categories['2008-12/msg00226.html'] = ['year-2008', 'month-12', 'author-BM<bm@xxxxx>' ] ....etc then... import nltk path = 'path/to/files/' corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*.html', cat_map=categories)
  • 31. Simple categories cats = corpus.categories() authorcats=[c for c in cats if c.startswith('author')] #>>> len(authorcats) #608 yearcats=[c for c in cats if c.startswith('year')] monthcats=[c for c in cats if c.startswith('month')]
  • 32. ...who are the top posters? posts = [(len(corpus.fileids(author)), author) for author in authorcats] posts.sort(reverse=True) for count, author in posts[:10]: print "%5dt%s" % (count, author) → 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM (email me if you're curious to know if you're on it...)
  • 33. Frequency distributions popular =['ubuntu','debian','fedora','arch'] niche = ['gentoo','suse','centos','redhat'] def getcfd(distros,limit): cfd = nltk.ConditionalFreqDist( (distro, fileid[:limit]) for fileid in corpus.fileids() for w in corpus.words(fileid) for distro in distros if w.lower().startswith(distro)) return cfd popularcfd = getcfd(popular,4) # or 7 for months popularcfd.plot() nichecfd = getcfd(niche,4) nichecfd.plot() another “NLTKism”
  • 37. Random text generation import random words = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) def generate_model(cfdist, word, num=15): for i in range(num): print word, words = list(cfdist[word]) word = random.choice(words) text = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'hi', num=20)
  • 38. hi... hi allan : ages since apparently yum erased . attempts now venturing into config run ip 10 431 ms 57 hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
  • 39. hi from Peter... text = [w.lower() for w in corpus.words(categories= [c for c in authorcats if 'PeterL' in c])] hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep , hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure . hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
  • 40. interesting collocations ...or not ext = [w.lower() for w in corpus.words() if w.isalpha()] from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(text) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufread busmaster speccycle cellx celly cheswick bellovin cread clocal curtail atl dmcrs rscem dmmrbc dmost dmost dmcrs ...
  • 41. oblig tag cloud stopwords = nltk.corpus.stopwords.words('english') words = [w.lower() for w in corpus.words() if w.isalpha()] words = [w for w in words if w not in stopwords] word_fd = nltk.FreqDist(words) wordmax = word_fd[word_fd.max()] wordmin = 1000 #YMMV taglist = word_fd.items() ranges = getRanges(wordmin, wordmax) writeCloud(taglist, ranges, 'tags.html')
  • 42.
  • 43. another one for Peter :) cats = [c for c in corpus.categories() if 'PeterL' in c] words=[w.lower() for w in corpus.words(categories=cats) if w.isalpha()] wordmin = 10 →
  • 44. thanks! for more corpus fun: http://www.nltk.org/ The Book: 'Natural Language Processing with Python', 2nd ed. pub. Jan 2010 These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license, v3.0 unported. The data set is not free, sadly...