SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
Common failures to be avoided
 when we analyze Wikipedia
        public data


           Felipe Ortega
           GsyC/Libresoft

   Wikimania 2010, Gdańsk, Poland.
#1 Beware of special types of pages
●   Official page count includes articles with just one link.
    ●   You must consider if you need to filter out
        disambiguation pages.
●   Pay attention to redirects.
    ●   Sometimes people wonder how the number of
        pages in main namespace in the dump is so high.
●   Break down evolution trends by namespace.
    ●   Articles are very different from other pages.
    ●   Explore % of already existing talk pages.
    ●   Connections from user pages.
#2 Plan your hardware carefully
●   There are some general rules.
    ●   Parallelize as much as possible.
    ●   Buy more memory before buying more disk...
    ●   But take a look at your disk requirements.
        – It's very different when you can work on
          decompressed data, on the fly.
    ●   Hardware RAID is not always the best solution.
        –   RAID 10 in Linux can perform decently in many
            average studies.
3# Know your engines (DBs)
●   Correct configuration of DB engine is crucial.
    ●   You'll always fall short with standard configs.
    ●   Fine tune parameters according to your hardware.
    ●   Exploit memory as much as possible.
        – E.g. MEMORY engine in MySQL.
    ●   Avoid unnecessary backup...
    ●   But be sure that you have copies of relevant info
        elseware!
    ●   Think about your process:
        –   Read only vs. read-write.
4# Organize your code
●   Using a SCM is a must.
    ●   SVN, GIT.
●   Upload your code to public repository.
    ●   BerliOS, SourceForge, GitHub...
●   Document your code...
    ●   ...if you ever aspire to get interest from other
        developers.
●   Use consistent version numbers.
●   Test, test, test...
    ●   Include sanity checks and “toy tests”.
#5 Use the right “spell”
●   Target data is well defined:
    ●   XML
    ●   Big portions of plain text
    ●   Inter-wiki links and outlinks.
●   Some alternatives
    ●   CelementTree (high-speed parsing)
    ●   Python (modules/short scripts) or Java (big
        projects).
    ●   Perl (regexps).
    ●   Sed & awk
#6 Avoid reinventing the wheel
●   Consider to develop only if:
    ●   No available solution fits your needs.
        – Or you can only find proprietary/evaluation
          sofwtare.
    ●   Performance of other solutions is really bad
●   Example: pywikipediabot
    ●   Simple library to query Wikipedia API.
    ●   Solves many simple needs of
        researchers/programmers.
#7 Automate everything

●   Huge data repositories.
●   Even small samples are excessively time
    consuming if processed by hand.
●   You will start to concat individual processes.
●   You will save time for later executions.
●   Your study will be reproducible.
    ●   Updating results after several months
        becomes no-brainer solution.
#8 Extreme case of Murphy's Law

●   Always expect the worst possible case.
    ●   Many caveats in each implementation.
    ●   Countless particular cases.
    ●   It's not OK with just the “average
        solution”.
         – Standard algorithms may take much
           more than expected to finish the job.
#9 Not many graphical interfaces

●   Some good reasons for that
    ●  Difficult to automate
     ● Hard to display dynamic results in real-time.


     ● Almost impossible to compute all results in a

       reasonable time frame for huge data
       collections (e.g. English Wikipedia).
●   To the best of my knowledge, there are very
    few tools with graphical interfaces out there.
●   Is there a real need for that??
#10 Communication channels

●   Wikimedia-research-l
    ●   Mailing list about research on Wikimedia
        projects.
    ●   http://meta.wikimedia.org/wiki/Research
    ●   http://meta.wikimedia.org/wiki/Wikimedia_Research_Network
    ●
        http://acawiki.org/Home
●   Final comments
    ●   Need for consolidated info point, once for all

Más contenido relacionado

Similar a Avoid Common Failures When Analyzing Wikipedia Data

Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Acquia
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Community
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongPatrick McGarry
 
LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101Linaro
 
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationOpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationAlkacon Software GmbH & Co. KG
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)ncoghlan_dev
 
2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with Blackfire2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with BlackfireMarko Mitranić
 
The Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with androidThe Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with androidStanojko Markovik
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Community
 
Liferay portals in real projects
Liferay portals  in real projectsLiferay portals  in real projects
Liferay portals in real projectsIBACZ
 
C4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web FrameworkC4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web FrameworkFrancis Addai
 
An overview of data and web-application development with Python
An overview of data and web-application development with PythonAn overview of data and web-application development with Python
An overview of data and web-application development with PythonSivaranjan Goswami
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java codeAttila Balazs
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)ncoghlan_dev
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Community
 

Similar a Avoid Common Failures When Analyzing Wikipedia Data (20)

Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
 
engage 2014 - JavaBlast
engage 2014 - JavaBlastengage 2014 - JavaBlast
engage 2014 - JavaBlast
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Go at Skroutz
Go at SkroutzGo at Skroutz
Go at Skroutz
 
LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101
 
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationOpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)
 
2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with Blackfire2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with Blackfire
 
Create your library
Create your libraryCreate your library
Create your library
 
The Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with androidThe Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with android
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Liferay portals in real projects
Liferay portals  in real projectsLiferay portals  in real projects
Liferay portals in real projects
 
C4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web FrameworkC4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web Framework
 
An overview of data and web-application development with Python
An overview of data and web-application development with PythonAn overview of data and web-application development with Python
An overview of data and web-application development with Python
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 

Último

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

Avoid Common Failures When Analyzing Wikipedia Data

  • 1. Common failures to be avoided when we analyze Wikipedia public data Felipe Ortega GsyC/Libresoft Wikimania 2010, Gdańsk, Poland.
  • 2. #1 Beware of special types of pages ● Official page count includes articles with just one link. ● You must consider if you need to filter out disambiguation pages. ● Pay attention to redirects. ● Sometimes people wonder how the number of pages in main namespace in the dump is so high. ● Break down evolution trends by namespace. ● Articles are very different from other pages. ● Explore % of already existing talk pages. ● Connections from user pages.
  • 3. #2 Plan your hardware carefully ● There are some general rules. ● Parallelize as much as possible. ● Buy more memory before buying more disk... ● But take a look at your disk requirements. – It's very different when you can work on decompressed data, on the fly. ● Hardware RAID is not always the best solution. – RAID 10 in Linux can perform decently in many average studies.
  • 4. 3# Know your engines (DBs) ● Correct configuration of DB engine is crucial. ● You'll always fall short with standard configs. ● Fine tune parameters according to your hardware. ● Exploit memory as much as possible. – E.g. MEMORY engine in MySQL. ● Avoid unnecessary backup... ● But be sure that you have copies of relevant info elseware! ● Think about your process: – Read only vs. read-write.
  • 5. 4# Organize your code ● Using a SCM is a must. ● SVN, GIT. ● Upload your code to public repository. ● BerliOS, SourceForge, GitHub... ● Document your code... ● ...if you ever aspire to get interest from other developers. ● Use consistent version numbers. ● Test, test, test... ● Include sanity checks and “toy tests”.
  • 6. #5 Use the right “spell” ● Target data is well defined: ● XML ● Big portions of plain text ● Inter-wiki links and outlinks. ● Some alternatives ● CelementTree (high-speed parsing) ● Python (modules/short scripts) or Java (big projects). ● Perl (regexps). ● Sed & awk
  • 7. #6 Avoid reinventing the wheel ● Consider to develop only if: ● No available solution fits your needs. – Or you can only find proprietary/evaluation sofwtare. ● Performance of other solutions is really bad ● Example: pywikipediabot ● Simple library to query Wikipedia API. ● Solves many simple needs of researchers/programmers.
  • 8. #7 Automate everything ● Huge data repositories. ● Even small samples are excessively time consuming if processed by hand. ● You will start to concat individual processes. ● You will save time for later executions. ● Your study will be reproducible. ● Updating results after several months becomes no-brainer solution.
  • 9. #8 Extreme case of Murphy's Law ● Always expect the worst possible case. ● Many caveats in each implementation. ● Countless particular cases. ● It's not OK with just the “average solution”. – Standard algorithms may take much more than expected to finish the job.
  • 10. #9 Not many graphical interfaces ● Some good reasons for that ● Difficult to automate ● Hard to display dynamic results in real-time. ● Almost impossible to compute all results in a reasonable time frame for huge data collections (e.g. English Wikipedia). ● To the best of my knowledge, there are very few tools with graphical interfaces out there. ● Is there a real need for that??
  • 11. #10 Communication channels ● Wikimedia-research-l ● Mailing list about research on Wikimedia projects. ● http://meta.wikimedia.org/wiki/Research ● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network ● http://acawiki.org/Home ● Final comments ● Need for consolidated info point, once for all