Más contenido relacionado
Similar a Gors appropriate (17)
Gors appropriate
- 4. So – the first thing to note is that I’m a technology opAmist: I believe technology can
help make our lives simpler, even if at first it may look as if we are making it more
complex by introducing yet more tools to learn – and install on computers that our IT
department would rather we leT under their control.
Taking control of your compuAng desAny is another theme of this talk…
In this example, the box diagram I showed on the first line was /wrien/ rather than
drawn. If I want to add steps, or have sub-branches added to the diagram, I don’t
need to start faffing around in Powerpoint or Word figures trying to line things up
and get them sized right and so on.
I let the machine do it.
In this parAcular online tool (you can see the URL in the screenshot at the top of the
slide – I’ll pop a copy of the annotated slides online, and also let Alan have a copy) –
so, in this parAcular tool, blockdiag, there are other diagram types available.
The underlying code is also opensource and available as a python package, so you can
write diagrams such as these in a Jupyter notebook, for example.
I’ll have more to say about Jupyter notebooks later.
4
- 6. So, the pipeline.
The first step, acquisiAon, relates to how we get hold of data This may be from
downloaded data files – Excel spreadsheet documents (which are actually zip files –
you know you can change the xlsx suffix to zip and unzip them, right? Same with docx
Word document files and pptx Powerpoint files), databases, online APIs (applicaAon
programmable interfaces), but it may be scraped from other sorts of document. Web
pages, for example, or PDF documents (even though PDF documents are horrible, it’s
oTen quite easy to extract data tables from them).
I’m not going to talk about the mechanics of scraping, but journalism lecturer Paul
Bradshaw has a good intro to a variety of tools and techniques in his Leanpub book
“Scraping for Journalists”.
6
- 10. Another tool I use from Ame to Ame is Apache Tika – this can extract text from PDFs,
Word documents and so on, as well as from images.
There are quite a few online OCR services now, many of them appearing as part of
“AI toolsets”, offering a range of commodity AI API services – IBM, MicrosoT and
Google all have them, for example.
So as well as OCR text extracAon, they do face and emoAon detecAon in images,
semanAc tagging / enAty labeling within documents, automaAc image tagging,
speech to text, and so on. All with varying degrees of success. But all of them steadily
improving.
10
- 11. ATer data acquisiAon, we’re oTen faced with cleaning a dataset. A tool I used for
cleaning data is another Java applicaAon, again accessed via a browser, called
OpenRefine.
OpenRefine will open a wide range of document types – spreadsheets, csv or tabbed
data files, XML, JSON, HTML – either locally or from the web, and presents it in a
spreadsheet style UI.
A wide range of opAons are provided for applying a parAcular transformaAon to each
cell in a parAcular column – you can also script your own in a custom scripAng
language, or Python – as well as tools for faceAng and filtering the display of rows
based on values within one or more columns.
The clustering tools are useful for finding and correcAng parAal matches – so for
example, you can normalise MyCo Ltd, with MyCo Ltd., with MyCo Limited, and so
on.
11
- 34. At the moment, we’re currently rewriAng a day long residenAal school acAvity that
uses Lego robots. UnAl this year, we’ve used the original yellow Lego Mindstorms
RCX brick. This year, we’re using the Lego EV3 brick, which has wifi and can be set up
to run Linux and a python shell that can access the robot’s bits.
The approach I’ve been exploring it to run a remote IPython kernel on the brick, and
a Juoyter server on a desktop machine, and then connect a notebook to the remote
kernel via the Jupyter server.
Running the notebook server on the brick removes the load of running the server
from the brick. (The same approach can be – and is – used to run large tasks on
supercomputer clusters.)
The notebooks also allow us to create simple interacAve Uis – just like R has the shiny
framework, the Jupyter notebooks can run interacAve ipywidgets direclty wired to
python state. In the example abovem I have a slide for controlling motor speed, for
example (actually, the duty cycle fo the stepper motor) and another that displays the
value being seen by a parAcular sensor. (Again, there’s a Any element of simplisAc
data2text contextualisaAon in the display.)
34
- 36. And finally, a last bit of blatant self-promoAon.
In the same way that maths has recreaAonal maths – fun puzzles in the Sunday
papers – I engage in recreaAonal data acAviAes. And as with the blog, I keep a record
of what I’ve done.
Several years ago, I started to learn R, and used Formula One results and Aming
sheets data as context for that. Over the years, I’ve pulled various tricks and
techniques together into this evolving book. (Actually, the book was also another
experiment – Leanpub encourages you to publish as you write, and used markdown
for the manuscript. I was looking for an opportunity to explore whether we might be
able to use something like Rstudio, and in parAcular Rmd, R-markdown) for authoring
OU course materials, so this gave me a reason – and a context – for exploring such a
workflow).
It’s sAll a work in progress, bit at over 400 pages already it represents a reasonably
deep dive into the different things you can do with a limited range of datasets on a
parAcular topic, as well as exploring a variety of ways of using – and appropriaAng – R
to help us find stories in data.
36