SlideShare a Scribd company logo
1 of 5
Download to read offline
indeed scrape
November 10, 2015
1 Applying Data Science to DS Job Hunting
1.1 Indeed API
Indeed.com offers a publisher’s API for adding links in a web page or app. I decided to use this API to
gather a sample of job posting from which to scrape a list of skills.
The API will only return a maximum of 25 url’s, so one needs a trick to get a significant amount of data.
The trick I’m using now, is to query by zipcode. There are ˜43K in the US so that’s going to hopefully bring
us some hits. For now, I’m using 500 randomly selected samples of the ˜43K zipcodes, returning from 0 to
25 urls from each.
1.2 Parsing Out Skills
To parse out what I think are the skills, I use BeautifulSoup to iterate over the sections locating the bulleted
points:
SQL
Python
AWS
Visual inspection indicates that most of the time, an employer will use a list to itemize the position skills.
It would be cool to run a second supporting project that tries to verify this. How many job posting contain
any itemized lists versus those that do not ?
1.2.1 Stop Words
I wanted a way to add new stop words. The word “data” obviously shows up many times and is not helpful.
1.3 Begin Analysis
1.3.1 Bar Plot
To count up the parsed skill tokens, I employ SciKit-Learn’s CountVectorizer and produce a simple bar plot
output.
1.3.2 Locations
For this example, I’m using all the zipcodes that start with ‘9’ and 100 randomly selected samples.
In [7]: import indeed_scrape
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams[’figure.figsize’] = (15, 8)
%matplotlib inline
ind = indeed_scrape.Indeed()
ind.query = "data science"
1
ind.add_loc = ’9’ # will add regex-ed zip codes
ind.num_samp = 100
ind.stop_words = "data"
In [8]: plt.figure(figsize=(15, 8))
ind.main()
indeed scrape.Indeed() saves output to a file
In [9]: import pandas as pd
df = pd.read_csv("data_frame.csv")
corpus = df[’summary’]
Take a look at how many job postings were returned.
In [10]: df = df.drop_duplicates().dropna()
df[’url’].count()
Out[10]: 1451
1.4 Monogram
Above a Bi-gram analysis was performed by default. Let’s include single words in the n-gram range, (1,2),
and using a corpus that has been stemmed with NLTK.
In [11]: corpus_stem = df[’summary_toke’]
mat, fea = ind.vectorizer(corpus, n_min=1)
plt.figure(figsize=(15,8))
ind.plot_features(fea, mat)
2
1.5 Explore High Count Words
The word “experience” showed up with a high count. I want to know if there’s more to that. Experience
with a platform, technology, SQL or jusy previous analytic experiece. NLP is a deep rabbit hole, and I only
peered a short ways down for this project.
My word radius method gathers words to the left and right of a chosen keyword, and builds a corpus
from within that radius. Then I apply the CountVectorizer again.
You’ll notice that I need to write code to remove the keyword that was searched for, from the anlaysis.
Next iteration. . .
1.5.1 Experience
In [20]: plt.figure(figsize=(10,5))
# adjust stop words
ind.stop_words = "experience"
ind.add_stop_words()
words_in_radius = ind.find_words_in_radius(corpus, ’experience’, 5)
mat, fea = ind.vectorizer(words_in_radius, max_features=30, n_min=1)
ind.plot_features(fea, mat)
3
1.5.2 Skills
In [21]: plt.figure(figsize=(10,5))
# adjust stop words
ind.stop_words = "skills"
ind.add_stop_words()
words_in_radius = ind.find_words_in_radius(corpus, ’skills’, 5)
mat, fea = ind.vectorizer(words_in_radius, max_features=30, n_min=1)
ind.plot_features(fea, mat)
4
1.6 Job Postings Per City
In [14]: grp = df.groupby(’city’)
grp[’url’].count().sort_values()[-20:].plot(’bar’, alpha=0.5, figsize=(14,8), grid=True)
Out[14]: <matplotlib.axes. subplots.AxesSubplot at 0x7f9a74c3e290>
In [ ]:
5

More Related Content

What's hot

What's hot (11)

HTL Compilers and Tooling
HTL Compilers and ToolingHTL Compilers and Tooling
HTL Compilers and Tooling
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
Ch7
Ch7Ch7
Ch7
 
Ch7
Ch7Ch7
Ch7
 
Real World Optimization
Real World OptimizationReal World Optimization
Real World Optimization
 
Programming in c
Programming in cProgramming in c
Programming in c
 
Python Programming Essentials - M24 - math module
Python Programming Essentials - M24 - math modulePython Programming Essentials - M24 - math module
Python Programming Essentials - M24 - math module
 
Write a program that calculate the no of prime no,even and odd no.
Write a program that calculate the no of prime no,even and odd no.Write a program that calculate the no of prime no,even and odd no.
Write a program that calculate the no of prime no,even and odd no.
 
R: Apply Functions
R: Apply FunctionsR: Apply Functions
R: Apply Functions
 
[Quase] Tudo que você precisa saber sobre tarefas assíncronas
[Quase] Tudo que você precisa saber sobre  tarefas assíncronas[Quase] Tudo que você precisa saber sobre  tarefas assíncronas
[Quase] Tudo que você precisa saber sobre tarefas assíncronas
 

Viewers also liked

Gure herrikoari buruzko gauza interesgarri gehiago
Gure herrikoari buruzko gauza interesgarri gehiagoGure herrikoari buruzko gauza interesgarri gehiago
Gure herrikoari buruzko gauza interesgarri gehiago
MirenHP
 
Cover_Bread02
Cover_Bread02Cover_Bread02
Cover_Bread02
Jinyi Fan
 
Top 8 email administrator resume samples
Top 8 email administrator resume samplesTop 8 email administrator resume samples
Top 8 email administrator resume samples
tonychoper3905
 

Viewers also liked (11)

La puerta de m´ hamide el ghezlane Memorias de un viaje de vuelta e ida ...
La puerta de m´ hamide el ghezlane     Memorias de un viaje de vuelta e ida  ...La puerta de m´ hamide el ghezlane     Memorias de un viaje de vuelta e ida  ...
La puerta de m´ hamide el ghezlane Memorias de un viaje de vuelta e ida ...
 
Gure herrikoari buruzko gauza interesgarri gehiago
Gure herrikoari buruzko gauza interesgarri gehiagoGure herrikoari buruzko gauza interesgarri gehiago
Gure herrikoari buruzko gauza interesgarri gehiago
 
Teotenango
TeotenangoTeotenango
Teotenango
 
recommendation letter
recommendation letterrecommendation letter
recommendation letter
 
Oklahoma
OklahomaOklahoma
Oklahoma
 
LKHaggerty Resume
LKHaggerty ResumeLKHaggerty Resume
LKHaggerty Resume
 
Cover_Bread02
Cover_Bread02Cover_Bread02
Cover_Bread02
 
Compensation overview of 100% matching bonuses with
Compensation overview of 100% matching bonuses withCompensation overview of 100% matching bonuses with
Compensation overview of 100% matching bonuses with
 
Top 8 email administrator resume samples
Top 8 email administrator resume samplesTop 8 email administrator resume samples
Top 8 email administrator resume samples
 
Taller de revocos de tierra 2012 paredes de nava palencia españa organizado...
Taller de revocos de tierra 2012 paredes de nava  palencia  españa organizado...Taller de revocos de tierra 2012 paredes de nava  palencia  españa organizado...
Taller de revocos de tierra 2012 paredes de nava palencia españa organizado...
 
Past, present and the future of living standards in the Sheffield City Region
Past, present and the future of living standards in the Sheffield City RegionPast, present and the future of living standards in the Sheffield City Region
Past, present and the future of living standards in the Sheffield City Region
 

Similar to Analysis Of Open Positions In Data Science

Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
singingfish
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Database By Salman Mushtaq
Database By Salman MushtaqDatabase By Salman Mushtaq
Database By Salman Mushtaq
Salman Mushtaq
 

Similar to Analysis Of Open Positions In Data Science (20)

The First C# Project Analyzed
The First C# Project AnalyzedThe First C# Project Analyzed
The First C# Project Analyzed
 
Python Homework Help
Python Homework HelpPython Homework Help
Python Homework Help
 
Serverless GraphQL for Product Developers
Serverless GraphQL for Product DevelopersServerless GraphQL for Product Developers
Serverless GraphQL for Product Developers
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end Framework
 
MLSEV Virtual. From my First BigML Project to Production
MLSEV Virtual. From my First BigML Project to ProductionMLSEV Virtual. From my First BigML Project to Production
MLSEV Virtual. From my First BigML Project to Production
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
 
Introduction to coding using Python
Introduction to coding using PythonIntroduction to coding using Python
Introduction to coding using Python
 
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Get up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or lessGet up and running with google app engine in 60 minutes or less
Get up and running with google app engine in 60 minutes or less
 
Student mark Prediction application.pptx
Student mark Prediction application.pptxStudent mark Prediction application.pptx
Student mark Prediction application.pptx
 
Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPH
 
Chatbot - The developer's waterboy
Chatbot - The developer's waterboyChatbot - The developer's waterboy
Chatbot - The developer's waterboy
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 
Database By Salman Mushtaq
Database By Salman MushtaqDatabase By Salman Mushtaq
Database By Salman Mushtaq
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 
Composable and streamable Play apps
Composable and streamable Play appsComposable and streamable Play apps
Composable and streamable Play apps
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Analysis Of Open Positions In Data Science

  • 1. indeed scrape November 10, 2015 1 Applying Data Science to DS Job Hunting 1.1 Indeed API Indeed.com offers a publisher’s API for adding links in a web page or app. I decided to use this API to gather a sample of job posting from which to scrape a list of skills. The API will only return a maximum of 25 url’s, so one needs a trick to get a significant amount of data. The trick I’m using now, is to query by zipcode. There are ˜43K in the US so that’s going to hopefully bring us some hits. For now, I’m using 500 randomly selected samples of the ˜43K zipcodes, returning from 0 to 25 urls from each. 1.2 Parsing Out Skills To parse out what I think are the skills, I use BeautifulSoup to iterate over the sections locating the bulleted points: SQL Python AWS Visual inspection indicates that most of the time, an employer will use a list to itemize the position skills. It would be cool to run a second supporting project that tries to verify this. How many job posting contain any itemized lists versus those that do not ? 1.2.1 Stop Words I wanted a way to add new stop words. The word “data” obviously shows up many times and is not helpful. 1.3 Begin Analysis 1.3.1 Bar Plot To count up the parsed skill tokens, I employ SciKit-Learn’s CountVectorizer and produce a simple bar plot output. 1.3.2 Locations For this example, I’m using all the zipcodes that start with ‘9’ and 100 randomly selected samples. In [7]: import indeed_scrape import matplotlib.pyplot as plt from matplotlib import rcParams rcParams[’figure.figsize’] = (15, 8) %matplotlib inline ind = indeed_scrape.Indeed() ind.query = "data science" 1
  • 2. ind.add_loc = ’9’ # will add regex-ed zip codes ind.num_samp = 100 ind.stop_words = "data" In [8]: plt.figure(figsize=(15, 8)) ind.main() indeed scrape.Indeed() saves output to a file In [9]: import pandas as pd df = pd.read_csv("data_frame.csv") corpus = df[’summary’] Take a look at how many job postings were returned. In [10]: df = df.drop_duplicates().dropna() df[’url’].count() Out[10]: 1451 1.4 Monogram Above a Bi-gram analysis was performed by default. Let’s include single words in the n-gram range, (1,2), and using a corpus that has been stemmed with NLTK. In [11]: corpus_stem = df[’summary_toke’] mat, fea = ind.vectorizer(corpus, n_min=1) plt.figure(figsize=(15,8)) ind.plot_features(fea, mat) 2
  • 3. 1.5 Explore High Count Words The word “experience” showed up with a high count. I want to know if there’s more to that. Experience with a platform, technology, SQL or jusy previous analytic experiece. NLP is a deep rabbit hole, and I only peered a short ways down for this project. My word radius method gathers words to the left and right of a chosen keyword, and builds a corpus from within that radius. Then I apply the CountVectorizer again. You’ll notice that I need to write code to remove the keyword that was searched for, from the anlaysis. Next iteration. . . 1.5.1 Experience In [20]: plt.figure(figsize=(10,5)) # adjust stop words ind.stop_words = "experience" ind.add_stop_words() words_in_radius = ind.find_words_in_radius(corpus, ’experience’, 5) mat, fea = ind.vectorizer(words_in_radius, max_features=30, n_min=1) ind.plot_features(fea, mat) 3
  • 4. 1.5.2 Skills In [21]: plt.figure(figsize=(10,5)) # adjust stop words ind.stop_words = "skills" ind.add_stop_words() words_in_radius = ind.find_words_in_radius(corpus, ’skills’, 5) mat, fea = ind.vectorizer(words_in_radius, max_features=30, n_min=1) ind.plot_features(fea, mat) 4
  • 5. 1.6 Job Postings Per City In [14]: grp = df.groupby(’city’) grp[’url’].count().sort_values()[-20:].plot(’bar’, alpha=0.5, figsize=(14,8), grid=True) Out[14]: <matplotlib.axes. subplots.AxesSubplot at 0x7f9a74c3e290> In [ ]: 5