4. Crawlers: AKA spiders, bots, scrapers, data mining 80legs can crawl over 5,000,000 web pages in 1 hour Yahoo BOSS http://www.ibm.com/developerworks/linux/library/l-spider/?ca=dgr-lnxw01WebSpiderLinux ScraperWiki But Yahoo already has!!! Python crawlers • Mechanize • Harvestman • Scrapy • Spynner 99 on Google Code! Extractiv
JISC funded project – The call was about COLLABORATION. Small project £16,000? Worked on it when there were “quiet times” on the Collab Tools project.
The idea was to create an Amazon-like tool – “If you like this person – you may want to know these people” – drop that information into a social network ( something people used regularly )
All the bits are “out there” already – it should be just be a matter of assembling them. A CRAWLER is some code that goes and gets web pages. You give a SEMANTIC ENGINE messy data, like web pages and it gives you CONCEPTS and meaning The VISUALISATIONS are rife, let’s find an appropriate one and use it. Make the data EDITABLE by the people.
In a way there are lots of CRAWLING TECHNOLOGIES to choose from. 80legs is a service ( as is YAHOO BOSS) . You say, start with this URL and these regular expressions, call me when you have a spreadsheet. Yahoo have already crawled your website, used XPATH to fish data out. Well-proven tools like HT DIG and CURL ( quite hard to use not quite what I wanted ) The are open source crawlers ( most of them are RUBBISH! ) Used Harvestman. Indian developer. I crawled YORK, LEEDS and SHEFFIELD web sites ( AND the white rose consortium repository ) took a few days each
You have to store the data somewhere. MySQL is an obvious choice – but all the cool kids are using NOSQL databases. HOW YOU STORE DATA IS HOW YOU THINK ( ABOUT DATA ) Schema less. Graph databases ( to me ) seemed even cooler because you can do queries where you discover things. All the people who know 5 people or more who have been to the same country on an event linked to Biology.
One you have your data, you then want to find out about it. I looked at NODEBOX which is a collection of python libraries that let you SUMMARIZE data, get EMOTIAL rankings and SYNONYMS and it does VISUALISATION ( see background). Too complicated… too much data. There are services like Textwise, OpenAmplify and Calais. You say, here’s a web page, they say CABBAGES, FRANCE and TOMMY COOPER. OpenCalais – Thomson Reuters. Django module. GOOD ENOUGH
The next step was to ask everyone at York about their social media usage. Twitter accounts, blogs, followers, links. The survey tool was un-workable. I got nervous about asking for this data since I was already getting some people being a bit sniffy about using data from the web site. ALREADY HAD TOO MUCH DATA. VISUALISATIONS THAT LOOK PRETTY But don’t tell you anything.
The next step was to show that information somehow, in a way that people could interact and explore it. Javascript Libraries like THE JIT, Software tools like Gephi ) draw it all yourself like PROCESSING.
And the intention was for that data to end up in a YORK SOCIAL NETWORK PROFILE – automatically generated… Cyn.in PROFILES are a bit lame. BUDDYPRESS on the way … STATUSNET perhaps…
This is the end result… A Network of CONCEPTS and PEOPLE… linked to CONCEPTS and PEOPLE. This was a triumph of simply GETTING SOMETHING DONE.
JISC had lots of sites – conference and met other bid winners, project blog, Google Docs, wikis – Frederique left. Going to be “found out” with lots of bits of code that didn’t work. COULDN’T WORK… Holy Grail.
Harvestman crawler – and old project… Delicious and Twitter changed their APIs – making them much less hacky NEO4J unfinished corners of the API – so I could either write it myself or wait a few months Experimenting with other technologies and datasets. You don’t know until it’s too late.
The hope was that you grab lots of data then SIFT out meaningful information. HAD TOO MUCH STUFF. Get rid of the crap. PLURALS – TELEPHONE NUMBERS – OBVIOUS PLACES -
Tried to get the YCSSA team to help with Neo4J graph database. They helped me to understand more about graphs and networks and how there are something even clever people ( or computers can’t understand). I started to try and create ONE BIG MATRIX – scary maths. DO THE SIMPLEST THING FIRST…. Even if it seems boring. Because then people will have something to look at and help you with.
NOT BAD IS GOOD ENOUGH … it’s whether a concept connects two things. EDITABLE is a must. THERE IS NO SILVER BULLET -
Rather than let people search then find nothing. Show people what’s available and let them choose. A type-ahead search box, only lets you search for what’s there. Linguistics Eye of the Beholder: People happily ignore the non-relevant stuff
Did require anyone to enter data Didn’t have to ask anyone Cheap trick: Biggest squarest image Maybe related ( via Google ). Like magic…
Wanted to change the direction of the project mid-way… because stumbled upon a KILLER APP. A NEWS SITE: but every news article is linked to known data about University of York… and Leeds and Sheffield…
The animation is visualisation added a dimension of time to information. Waiting. It actually saved a lot of coding BY ACCIDENT… by pulling well-connected concepts spatially nearer to each other… FUN People VANITY SURF… then move wider I have seen people using it, and use the words “I didn’t know they were doing that at Leeds”
Or “on something”. Waiting for our “ social host ” … would need better programmers, or more of a dev team workshop. Did a JISC conference attendees blogs ( from their twitter accounts ) … a way of “meeting people before a conferece” perhaps. The Lots of Big Ideas Proposal at The Hub. Bring the web pages onto the walls. Proposed this to Liz Waller for Harry Fairhurst. Could have done with some JISC help ( but also was scared by JISC advice ) I probably wouldn’t do it again…