SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Python for Data Science
                            PyCon Finland 17.10.2011
                                Harri Hämäläinen
                             harri.hamalainen aalto fi
                                     #sgwwx




Wednesday, October 19, 11
What is Data Science
                    • Data science ~ computer science +
                            mathematics/statistics + visualization
                    • Most web companies are actually doing
                            some kind of Data Science:
                            -   Facebook, Amazon, Google, LinkedIn, Last.fm...

                            -   Social network analysis, recommendations,
                                community building, decision making, analyzing
                                emerging trends, ...


Wednesday, October 19, 11
Scope of the talk
                    • What is in this presentation
                            -   Pythonized tools for retrieving and dealing with
                                data

                            -   Methods, libraries, ethics

                    • What is not included
                            -   Dealing with very large data

                            -   Math or detailed algorithms behind library calls



Wednesday, October 19, 11
Outline

                    •       Harvesting

                    •       Cleaning

                    •
                    •
                            Analyzing
                            Visualizing
                                               Data
                    •       Publishing



Wednesday, October 19, 11
Data harvesting



Wednesday, October 19, 11
Authorative borders
                             for data sources
                    1. Data from your system
                            -   e.g. user access log files, purchase history, view
                                counts

                            -   e.g. sensor readings, manual gathering

                    2. Data from the services of others
                            -   e.g. view counts, tweets, housing prizes, sport
                                results, financing data



Wednesday, October 19, 11
Data sources

                    • Locally available data
                    • Data dumps from Web
                    • Data through Web APIs
                    • Structured data in Web documents

Wednesday, October 19, 11
API

                    • Available for many web applications
                            accessible with general Python libraries
                            -   urllib, soaplib, suds, ...

                    • Some APIs available even as application
                            specific Python libraries
                            -   python-twitter, python-linkedin, pyfacebook, ...



Wednesday, October 19, 11
Data sources

                    • Locally available data
                    • Data dumps in Web
                    • Data through Web APIs
                    • Structured data in Web documents

Wednesday, October 19, 11
Crawling

                    • Crawlers (spiders, web robots) are used to
                            autonomously navigate the Web documents

                               import urllib
                               # queue holds urls, managed by some other component
                               def crawler(queue):
                                   url = queue.get()
                                   fd = urllib.urlopen(url)
                                   content = fd.read()
                                   links = parse_links(content)
                                   # process content
                                   for link in links:
                                       queue.put(link)




Wednesday, October 19, 11
Web Scraping
                    •       Extract information from structured documents in
                            Web
                    •       Multiple libraries for parsing XML documents
                    •       But in general web documents are rarely valid XML
                    •       Some candidates who will stand by you when data
                            contains “dragons”
                            -   BeautifulSoup

                            -   lxml



Wednesday, October 19, 11
urllib & BeautifulSoup
                            >>> import urllib, BeautifulSoup
                            >>> fd = urllib.urlopen('http://mobile.example.org/
                            foo.html')
                            >>> soup = BeautifulSoup.BeautifulSoup(fd.read())
                            >>> print soup.prettify()
                            ...
                            <table>
                                <tr><td>23</td></tr>
                                <tr><td>24</td></tr>
                                <tr><td>25</td></tr>
                                <tr><td>22</td></tr>
                                <tr><td>23</td></tr>
                            </table>
                            ...




                 >>> values = [elem.text for elem in soup.find('table').findChildren('td')]
                 >>>> values
                 [u'23', u'24', u'25', u'22', u'23']




Wednesday, October 19, 11
Scrapy
                    •       A framework for crawling web sites and extracting
                            structured data
                    •       Features
                            -   extracts elements from XML/HTML (with XPath)

                            -   makes it easy to define crawlers with support more specific
                                needs (e.g. HTTP compression headers, robots.txt, crawling
                                depth)

                            -   real-time debugging

                    •       http://scrapy.org


Wednesday, October 19, 11
Tips and ethics
                    •       Use the mobile version of the sites if available
                    •       No cookies
                    •       Respect robots.txt
                    •       Identify yourself
                    •       Use compression (RFC 2616)
                    •       If possible, download bulk data first, process it later
                    •       Prefer dumps over APIs, APIs over scraping
                    •       Be polite and request permission to gather the data
                    •       Worth checking: https://scraperwiki.com/

Wednesday, October 19, 11
Data cleansing
                             (preprocessing)




Wednesday, October 19, 11
Data cleansing
                    • Harvested data may come with lots of
                            noise
                    • ... or interesting anomalies?
                    • Detection
                            -   Scatter plots

                            -   Statistical functions describing distribution



Wednesday, October 19, 11
Data preprocessing

                    • Goal: provide structured presentation for
                            analysis
                            -   Network (graph)

                            -   Values with dimension




Wednesday, October 19, 11
Network representation

                    • Vast number of datasets are describing a
                            network
                            -   Social relations

                            -   Plain old web pages with links

                            -   Anything where some entities in data are related to
                                each other




Wednesday, October 19, 11
Analyzing the Data



Wednesday, October 19, 11
•       Offers efficient          • Builds on top of NumPy
                            multidimensional array
                            object, ndarray          • Modules for
                                                       • statistics, optimization,
                    •       Basic linear algebra             signal processing, ...
                            operations and data
                            types                    • Add-ons (called SciKits)
                                                       for

                    •       Requires GNU Fortran       •     machine learning

                                                       •     data mining

                                                       •     ...



Wednesday, October 19, 11
>>> pylab.scatter(weights, heights)
           >>> xlabel("weight (kg)")
                                                 Curve fitting with numpy polyfit &
           >>> ylabel("height (cm)")
                                                 polyval functions
           >>> pearsonr(weights, heights)
           (0.5028, 0.0)




Wednesday, October 19, 11
NumPy + SciPy +
                            Matplotlib + IPython
                      • Provides Matlab ”-ish” environment
                      • ipython provides extended interactive
                            interpreter (tab completion, magic
                            functions for object querying, debugging, ...)


                            ipython --pylab
                            In [1]: x = linspace(1, 10, 42)
                            In [2]: y = randn(42)
                            In [3]: plot(x, y, 'r--.', markersize=15)




Wednesday, October 19, 11
Analyzing networks


                    • Basicly two bigger alternatives
                            -   NetworkX

                            -   igraph




Wednesday, October 19, 11
NetworkX
                    • Python package for dealing with complex
                            networks:
                            -   Importers / exporters

                            -   Graph generators

                            -   Serialization

                            -   Algorithms

                            -   Visualization



Wednesday, October 19, 11
NetworkX
                            >>> import networkx
                            >>> g = networkx.Graph()
                            >>> data = {'Turku':     [('Helsinki', 165), ('Tampere', 157)],
                                        'Helsinki': [('Kotka', 133), ('Kajaani', 557)],
                                        'Tampere':   [('Jyväskylä', 149)],
                                        'Jyväskylä': [('Kajaani', 307)] }
                            >>> for k,v in data.items():
                            ...     for dest, dist in v:
                            ...         g.add_edge(k, dest, weight=dist)
                            >>> networkx.astar_path_length(g, 'Turku', 'Kajaani')
                            613
                            >>> networkx.astar_path(g, 'Kajaani', 'Turku')
                            ['Kajaani', 'Jyväskylä', 'Tampere', 'Turku']




Wednesday, October 19, 11
Visualizing Data



Wednesday, October 19, 11
NetworkX
                            >>> import networkx
                            >>> g = networkx.Graph()
                            >>> data = {'Turku':     [('Helsinki', 165), ('Tampere', 157)],
                                        'Helsinki': [('Kotka', 133), ('Kajaani', 557)],
                                        'Tampere':   [('Jyväskylä', 149)],
                                        'Jyväskylä': [('Kajaani', 307)] }
                            >>> for k,v in data.items():
                            ...     for dest, dist in v:
                            ...         g.add_edge(k, dest, weight=dist)
                            >>> networkx.astar_path_length(g, 'Turku', 'Kajaani')
                            613
                            >>> networkx.astar_path(g, 'Kajaani', 'Turku')
                            ['Kajaani', 'Jyväskylä', 'Tampere', 'Turku']
                            >>> networkx.draw(g)




Wednesday, October 19, 11
PyGraphviz

                    • Python interface for the Graphviz layout
                            engine
                            -   Graphviz is a collection of graph layout programs




Wednesday, October 19, 11
>>> pylab.scatter(weights, heights)
           >>> xlabel("weight (kg)")
           >>> ylabel("height (cm)")
           >>> pearsonr(weights, heights)
           (0.5028, 0.0)
           >>> title("Scatter plot for weight/height
           statisticsnn=25000, pearsoncor=0.5028")
           >>> savefig("figure.png")




Wednesday, October 19, 11
3D visualizations


                                           with mplot3d toolkit




Wednesday, October 19, 11
Data publishing



Wednesday, October 19, 11
Open Data
                    •       Certain data should be open and therefore available to everyone
                            to use in a way or another
                    •       Some open their data to others hoping it will be beneficial for
                            them or just because there’s no need to hide it
                    •       Examples of open dataset types
                            -   Government data

                            -   Life sciences data

                            -   Culture data

                            -   Commerce data

                            -   Social media data

                            -   Cross-domain data




Wednesday, October 19, 11
The Zen of Open Data
                                                Open is better than closed.
                                            Transparent is better than opaque.
                                               Simple is better than complex.
                                          Accessible is better than inaccessible.
                                              Sharing is better than hoarding.
                                            Linked is more useful than isolated.
                                         Fine grained is preferable to aggregated.

                            Optimize for machine readability — they can translate for humans.
                                                              ...
                            “Flawed, but out there” is a million times better than “perfect, but
                                                      unattainable”.
                                                              ...


                                                  Chris McDowall & co.


Wednesday, October 19, 11
Sharing the Data
                    • Some convenient formats
                            -   JSON (import simplejson)

                            -   XML (import xml)

                            -   RDF (import rdflib, SPARQLWrapper)

                            -   GraphML (import networkx)

                            -   CSV (import csv)



Wednesday, October 19, 11
Resource Description Framework
                             (RDF)
                    • Collection of W3C standards for modeling
                            complex relations and to exchange
                            information
                    • Allows data from multiple sources to
                            combine nicely
                    • RDF describes data with triples
                            -   each triple has form subject - predicate - object e.g.
                                PyconFi2011 is organized in Turku


Wednesday, October 19, 11
RDF Data

     @prefix poi:           <http://schema.onki.fi/poi#> .
     @prefix skos:          <http://www.w3.org/2004/02/skos/core#> .

     ...

     <http://purl.org/finnonto/id/rky/p437>
         a       poi:AreaOfInterest ;
         poi:description """Turun rautatieasema on maailmansotien välisen ajan merkittävimpiä asemarakennushankkeita Suomessa..."""@fi ;
         poi:hasPolygon "60.454833421,22.253543828 60.453846032,22.254787430 60.453815665,22.254725349..." ;
         poi:history """Turkuun suunniteltiin rautatietä 1860-luvulta lähtien, mutta ensimmäinen rautatieyhteys Turkuun..."""@fi ;
         poi:municipality kunnat:k853 ;
         poi:poiType poio:tuotantorakennus , poio:asuinrakennus , poio:puisto ;
         poi:webPage "http://www.rky.fi/read/asp/r_kohde_det.aspx?KOHDE_ID=1865" ;
         skos:prefLabel "Turun rautatieympäristöt"@fi .

     ...




Wednesday, October 19, 11
from SPARQLWrapper import SPARQLWrapper, JSON
    QUERY = """
       Prefix lgd:<http://linkedgeodata.org/>
       Prefix lgdo:<http://linkedgeodata.org/ontology/>
       Prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

          Select distinct ?label ?g From <http://linkedgeodata.org> {
             ?s rdf:type <http://linkedgeodata.org/ontology/Library> .
             ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label.
             ?s geo:geometry ?g .
             Filter(bif:st_intersects (?g, bif:st_point (24.9375, 60.170833), 1)) .
          }"""

    sparql = SPARQLWrapper("http://linkedgeodata.org/sparql")
    sparql.setQuery(QUERY)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()['results']['bindings']
    for result in results:
        print result['label']['value'].encode('utf-8'), result['g']['value']

                                        German library POINT(24.9495 60.1657)
        Query for libraries in          Metsätalo POINT(24.9497 60.1729)
                                        Opiskelijakirjasto POINT(24.9489 60.1715)
       Helsinki located within          Topelia POINT(24.9493 60.1713)
       1 kilometer radius from          Eduskunnan kirjasto POINT(24.9316 60.1725)
            the city centre             Rikhardinkadun kirjasto POINT(24.9467 60.1662)
                                        Helsinki 10 POINT(24.9386 60.1713)


Wednesday, October 19, 11
RDF Data sources
                    •       CKAN

                    •       DBpedia

                    •       LinkedGeoData

                    •       DBTune

                    •       http://semantic.hri.fi/




Wednesday, October 19, 11
harri.hamalainen aalto fi
Wednesday, October 19, 11

Más contenido relacionado

La actualidad más candente

Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using PythonSavitaHanchinal
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Python and its Applications
Python and its ApplicationsPython and its Applications
Python and its ApplicationsAbhijeet Singh
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
 
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...Machine Learning With Python | Machine Learning Algorithms | Machine Learning...
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...Simplilearn
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Conceptual dependency
Conceptual dependencyConceptual dependency
Conceptual dependencyJismy .K.Jose
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Introduction to-python
Introduction to-pythonIntroduction to-python
Introduction to-pythonAakashdata
 

La actualidad más candente (20)

Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Python and its Applications
Python and its ApplicationsPython and its Applications
Python and its Applications
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...Machine Learning With Python | Machine Learning Algorithms | Machine Learning...
Machine Learning With Python | Machine Learning Algorithms | Machine Learning...
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Conceptual dependency
Conceptual dependencyConceptual dependency
Conceptual dependency
 
Python programming : Classes objects
Python programming : Classes objectsPython programming : Classes objects
Python programming : Classes objects
 
Python Basics
Python BasicsPython Basics
Python Basics
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Recognition-of-tokens
Recognition-of-tokensRecognition-of-tokens
Recognition-of-tokens
 
Introduction to-python
Introduction to-pythonIntroduction to-python
Introduction to-python
 

Similar a Python for Data Science

Spotify: Playing for millions, tuning for more
Spotify: Playing for millions, tuning for moreSpotify: Playing for millions, tuning for more
Spotify: Playing for millions, tuning for moreNick Barkas
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Search all the things
Search all the thingsSearch all the things
Search all the thingscyberswat
 
Sharing information with MediaWiki
Sharing information with MediaWikiSharing information with MediaWiki
Sharing information with MediaWikiGeert Van Pamel
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk publicNesta
 
Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume LaforgeGaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume LaforgeGuillaume Laforge
 
An Open Platform DM Solution for New Account Intake
An Open Platform DM Solution for New Account IntakeAn Open Platform DM Solution for New Account Intake
An Open Platform DM Solution for New Account IntakeAlfresco Software
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...benaam
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
LTR Handout
LTR HandoutLTR Handout
LTR Handoutkoegeljm
 
Nuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo
 
Instrumentation with Splunk
Instrumentation with SplunkInstrumentation with Splunk
Instrumentation with SplunkDatavail
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 

Similar a Python for Data Science (20)

Spotify: Playing for millions, tuning for more
Spotify: Playing for millions, tuning for moreSpotify: Playing for millions, tuning for more
Spotify: Playing for millions, tuning for more
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Search all the things
Search all the thingsSearch all the things
Search all the things
 
Sharing information with MediaWiki
Sharing information with MediaWikiSharing information with MediaWiki
Sharing information with MediaWiki
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
 
Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume LaforgeGaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
 
An Open Platform DM Solution for New Account Intake
An Open Platform DM Solution for New Account IntakeAn Open Platform DM Solution for New Account Intake
An Open Platform DM Solution for New Account Intake
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Webofdata
WebofdataWebofdata
Webofdata
 
LTR Handout
LTR HandoutLTR Handout
LTR Handout
 
Nuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to Nuxeo
 
Instrumentation with Splunk
Instrumentation with SplunkInstrumentation with Splunk
Instrumentation with Splunk
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 

Último

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Python for Data Science

  • 1. Python for Data Science PyCon Finland 17.10.2011 Harri Hämäläinen harri.hamalainen aalto fi #sgwwx Wednesday, October 19, 11
  • 2. What is Data Science • Data science ~ computer science + mathematics/statistics + visualization • Most web companies are actually doing some kind of Data Science: - Facebook, Amazon, Google, LinkedIn, Last.fm... - Social network analysis, recommendations, community building, decision making, analyzing emerging trends, ... Wednesday, October 19, 11
  • 3. Scope of the talk • What is in this presentation - Pythonized tools for retrieving and dealing with data - Methods, libraries, ethics • What is not included - Dealing with very large data - Math or detailed algorithms behind library calls Wednesday, October 19, 11
  • 4. Outline • Harvesting • Cleaning • • Analyzing Visualizing Data • Publishing Wednesday, October 19, 11
  • 6. Authorative borders for data sources 1. Data from your system - e.g. user access log files, purchase history, view counts - e.g. sensor readings, manual gathering 2. Data from the services of others - e.g. view counts, tweets, housing prizes, sport results, financing data Wednesday, October 19, 11
  • 7. Data sources • Locally available data • Data dumps from Web • Data through Web APIs • Structured data in Web documents Wednesday, October 19, 11
  • 8. API • Available for many web applications accessible with general Python libraries - urllib, soaplib, suds, ... • Some APIs available even as application specific Python libraries - python-twitter, python-linkedin, pyfacebook, ... Wednesday, October 19, 11
  • 9. Data sources • Locally available data • Data dumps in Web • Data through Web APIs • Structured data in Web documents Wednesday, October 19, 11
  • 10. Crawling • Crawlers (spiders, web robots) are used to autonomously navigate the Web documents import urllib # queue holds urls, managed by some other component def crawler(queue): url = queue.get() fd = urllib.urlopen(url) content = fd.read() links = parse_links(content) # process content for link in links: queue.put(link) Wednesday, October 19, 11
  • 11. Web Scraping • Extract information from structured documents in Web • Multiple libraries for parsing XML documents • But in general web documents are rarely valid XML • Some candidates who will stand by you when data contains “dragons” - BeautifulSoup - lxml Wednesday, October 19, 11
  • 12. urllib & BeautifulSoup >>> import urllib, BeautifulSoup >>> fd = urllib.urlopen('http://mobile.example.org/ foo.html') >>> soup = BeautifulSoup.BeautifulSoup(fd.read()) >>> print soup.prettify() ... <table> <tr><td>23</td></tr> <tr><td>24</td></tr> <tr><td>25</td></tr> <tr><td>22</td></tr> <tr><td>23</td></tr> </table> ... >>> values = [elem.text for elem in soup.find('table').findChildren('td')] >>>> values [u'23', u'24', u'25', u'22', u'23'] Wednesday, October 19, 11
  • 13. Scrapy • A framework for crawling web sites and extracting structured data • Features - extracts elements from XML/HTML (with XPath) - makes it easy to define crawlers with support more specific needs (e.g. HTTP compression headers, robots.txt, crawling depth) - real-time debugging • http://scrapy.org Wednesday, October 19, 11
  • 14. Tips and ethics • Use the mobile version of the sites if available • No cookies • Respect robots.txt • Identify yourself • Use compression (RFC 2616) • If possible, download bulk data first, process it later • Prefer dumps over APIs, APIs over scraping • Be polite and request permission to gather the data • Worth checking: https://scraperwiki.com/ Wednesday, October 19, 11
  • 15. Data cleansing (preprocessing) Wednesday, October 19, 11
  • 16. Data cleansing • Harvested data may come with lots of noise • ... or interesting anomalies? • Detection - Scatter plots - Statistical functions describing distribution Wednesday, October 19, 11
  • 17. Data preprocessing • Goal: provide structured presentation for analysis - Network (graph) - Values with dimension Wednesday, October 19, 11
  • 18. Network representation • Vast number of datasets are describing a network - Social relations - Plain old web pages with links - Anything where some entities in data are related to each other Wednesday, October 19, 11
  • 20. Offers efficient • Builds on top of NumPy multidimensional array object, ndarray • Modules for • statistics, optimization, • Basic linear algebra signal processing, ... operations and data types • Add-ons (called SciKits) for • Requires GNU Fortran • machine learning • data mining • ... Wednesday, October 19, 11
  • 21. >>> pylab.scatter(weights, heights) >>> xlabel("weight (kg)") Curve fitting with numpy polyfit & >>> ylabel("height (cm)") polyval functions >>> pearsonr(weights, heights) (0.5028, 0.0) Wednesday, October 19, 11
  • 22. NumPy + SciPy + Matplotlib + IPython • Provides Matlab ”-ish” environment • ipython provides extended interactive interpreter (tab completion, magic functions for object querying, debugging, ...) ipython --pylab In [1]: x = linspace(1, 10, 42) In [2]: y = randn(42) In [3]: plot(x, y, 'r--.', markersize=15) Wednesday, October 19, 11
  • 23. Analyzing networks • Basicly two bigger alternatives - NetworkX - igraph Wednesday, October 19, 11
  • 24. NetworkX • Python package for dealing with complex networks: - Importers / exporters - Graph generators - Serialization - Algorithms - Visualization Wednesday, October 19, 11
  • 25. NetworkX >>> import networkx >>> g = networkx.Graph() >>> data = {'Turku': [('Helsinki', 165), ('Tampere', 157)], 'Helsinki': [('Kotka', 133), ('Kajaani', 557)], 'Tampere': [('Jyväskylä', 149)], 'Jyväskylä': [('Kajaani', 307)] } >>> for k,v in data.items(): ... for dest, dist in v: ... g.add_edge(k, dest, weight=dist) >>> networkx.astar_path_length(g, 'Turku', 'Kajaani') 613 >>> networkx.astar_path(g, 'Kajaani', 'Turku') ['Kajaani', 'Jyväskylä', 'Tampere', 'Turku'] Wednesday, October 19, 11
  • 27. NetworkX >>> import networkx >>> g = networkx.Graph() >>> data = {'Turku': [('Helsinki', 165), ('Tampere', 157)], 'Helsinki': [('Kotka', 133), ('Kajaani', 557)], 'Tampere': [('Jyväskylä', 149)], 'Jyväskylä': [('Kajaani', 307)] } >>> for k,v in data.items(): ... for dest, dist in v: ... g.add_edge(k, dest, weight=dist) >>> networkx.astar_path_length(g, 'Turku', 'Kajaani') 613 >>> networkx.astar_path(g, 'Kajaani', 'Turku') ['Kajaani', 'Jyväskylä', 'Tampere', 'Turku'] >>> networkx.draw(g) Wednesday, October 19, 11
  • 28. PyGraphviz • Python interface for the Graphviz layout engine - Graphviz is a collection of graph layout programs Wednesday, October 19, 11
  • 29. >>> pylab.scatter(weights, heights) >>> xlabel("weight (kg)") >>> ylabel("height (cm)") >>> pearsonr(weights, heights) (0.5028, 0.0) >>> title("Scatter plot for weight/height statisticsnn=25000, pearsoncor=0.5028") >>> savefig("figure.png") Wednesday, October 19, 11
  • 30. 3D visualizations with mplot3d toolkit Wednesday, October 19, 11
  • 32. Open Data • Certain data should be open and therefore available to everyone to use in a way or another • Some open their data to others hoping it will be beneficial for them or just because there’s no need to hide it • Examples of open dataset types - Government data - Life sciences data - Culture data - Commerce data - Social media data - Cross-domain data Wednesday, October 19, 11
  • 33. The Zen of Open Data Open is better than closed. Transparent is better than opaque. Simple is better than complex. Accessible is better than inaccessible. Sharing is better than hoarding. Linked is more useful than isolated. Fine grained is preferable to aggregated. Optimize for machine readability — they can translate for humans. ... “Flawed, but out there” is a million times better than “perfect, but unattainable”. ... Chris McDowall & co. Wednesday, October 19, 11
  • 34. Sharing the Data • Some convenient formats - JSON (import simplejson) - XML (import xml) - RDF (import rdflib, SPARQLWrapper) - GraphML (import networkx) - CSV (import csv) Wednesday, October 19, 11
  • 35. Resource Description Framework (RDF) • Collection of W3C standards for modeling complex relations and to exchange information • Allows data from multiple sources to combine nicely • RDF describes data with triples - each triple has form subject - predicate - object e.g. PyconFi2011 is organized in Turku Wednesday, October 19, 11
  • 36. RDF Data @prefix poi: <http://schema.onki.fi/poi#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . ... <http://purl.org/finnonto/id/rky/p437> a poi:AreaOfInterest ; poi:description """Turun rautatieasema on maailmansotien välisen ajan merkittävimpiä asemarakennushankkeita Suomessa..."""@fi ; poi:hasPolygon "60.454833421,22.253543828 60.453846032,22.254787430 60.453815665,22.254725349..." ; poi:history """Turkuun suunniteltiin rautatietä 1860-luvulta lähtien, mutta ensimmäinen rautatieyhteys Turkuun..."""@fi ; poi:municipality kunnat:k853 ; poi:poiType poio:tuotantorakennus , poio:asuinrakennus , poio:puisto ; poi:webPage "http://www.rky.fi/read/asp/r_kohde_det.aspx?KOHDE_ID=1865" ; skos:prefLabel "Turun rautatieympäristöt"@fi . ... Wednesday, October 19, 11
  • 37. from SPARQLWrapper import SPARQLWrapper, JSON QUERY = """ Prefix lgd:<http://linkedgeodata.org/> Prefix lgdo:<http://linkedgeodata.org/ontology/> Prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> Select distinct ?label ?g From <http://linkedgeodata.org> { ?s rdf:type <http://linkedgeodata.org/ontology/Library> . ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label. ?s geo:geometry ?g . Filter(bif:st_intersects (?g, bif:st_point (24.9375, 60.170833), 1)) . }""" sparql = SPARQLWrapper("http://linkedgeodata.org/sparql") sparql.setQuery(QUERY) sparql.setReturnFormat(JSON) results = sparql.query().convert()['results']['bindings'] for result in results: print result['label']['value'].encode('utf-8'), result['g']['value'] German library POINT(24.9495 60.1657) Query for libraries in Metsätalo POINT(24.9497 60.1729) Opiskelijakirjasto POINT(24.9489 60.1715) Helsinki located within Topelia POINT(24.9493 60.1713) 1 kilometer radius from Eduskunnan kirjasto POINT(24.9316 60.1725) the city centre Rikhardinkadun kirjasto POINT(24.9467 60.1662) Helsinki 10 POINT(24.9386 60.1713) Wednesday, October 19, 11
  • 38. RDF Data sources • CKAN • DBpedia • LinkedGeoData • DBTune • http://semantic.hri.fi/ Wednesday, October 19, 11