School of Data - mapping company networks

Some
slide
prompts
to
support
a
data
framing
inves3ga3on
around
corporate
data
–

originally
prepared
for
the
OGP
Fes3val,
London,
October
2013.

For
more
informa3on,
contact:
schoolOfData.org

1

These
notes
provide
a
worked
example
of
how
to
download
company
ownership

rela3onship
data
from
OpenCorproates
(opencorporates.com)
using
the
cross-‐
plaNorm
data
cleaning
tool
OpenReﬁne
(openreﬁne.org),
and
then
visualise
the
data

using
the
cross-‐plaNorm
Gephi
netwrok
visualisa3on
tool
(gephi.org).

2

OpenCorporates
is
a
private
company
that
has
set
itself
the
ambi3ous
task
of
building

a
database
of
registered
company
informa3on
for
every
legal
corporate
en3ty
in
the

world.

One
of
the
views
OpenCorporates
oﬀers
over
at
least
some
of
the
data
in
its
database

shows
how
companies
are
connected
by
beneﬁcial
ownership
or
shareholder

rela3onships.

Although
complex,
this
diagram
is
“human
readable”
–
the
data
is
presented
in
a
way

that
is
intended
to
make
some
sort
of
meaningful
sense
to
us.

3

But
as
well
as
publishing
data
for
us
humans
to
read,
OpenCorporates
also
makes

data
available
in
a
way
that
machines
can
read

-‐
machine
readable
data.

You
may
have
heard
of
the
term
“API”
in
the
context
of
data
publishing
websites.
To

all
intents
and
purposes,
an
API
is
an
interface
that
computers
can
use
to
get

informa3on
out
of
websites
in
a
way
that
they,
and
the
databases
they
work
with,
can

understand.

The
data
is
published
in
a
format
known
as
JSON
–
Javascript
Object
Nota3on.
But

you
don’t
really
need
to
know
much
more
than
that
–
just
that
it’s
called
JSON,
and

tools
that
can
parse
and
work
with
JSON
can
parse
and
work
with
the
data
that
the

OpenCorporates
API
publishes.

4

If
you
aren’t
a
programmer,
here’s
way
of
ge]ng
the
data
out
of
OpenCorporates
and

into
a
tabular
form
you
may
be
more
comfortable
with,
and
which
we
can
use
to

generate
a
network
diagram
to
display
in
a
tool
such
as
Gephi…

You
can
download
the
OpenReﬁne
applica3on
from
openreﬁne.org.
When
you
run
it

on
your
computer,
it
will
launch
an
applica3on
that
runs
inside
a
browser
tab
using

your
default
web
browser.

5

We
can
get
company
ownership
(subsidiary
rela3ons,
major
shareholdings,
etc)
from

OpenCorporates
by
hacking
the
web
address/URL
of
a
company
page
on

OpenCorporates.

From
a
company
page
on
OpenCorporates,
which
should
have
the
form:

http://opencorporates.com/companies/JURISDICTION/
COMPANY_ID!
add
the
following
to
the
end
of
the
web
address/URL:

/network.json?depth=2

to
give
something
with
the
following
form:

http://opencorporates.com/companies/JURISDICTION/
COMPANY_ID/network.json?depth=2!
(Note:
company
network
data
may
not
be
available
in
all
jurisdic3ons
or
for
all

companies.)

6

In
OpenReﬁne,
select
the
op3on
to
Create
[a
new]
Project
using
the
web
address
–
or

URL
–
to
the
JSON
data
page
that
reveals
the
data
rela3ng
to
the
corporate
ownership

network
of
the
company
we
are
interested
in
on
OpenCorporates.

Note
that
you
can
import
data
into
OpenReﬁne
from
several
web
addresses
all
in
one

go,
though
the
data
returned
from
each
URL
should
have
the
same
format
or

structure.

Using
mul3ple
URLs
results
in
a
combined
data
set,
which
can
be
quite
handy.

7

Being
machine
readable,
the
data
makes
more
sense
to
OpenReﬁne
than
it
probably

does
to
us!

Select
a
block
of
data
in
the
preview
view
that
is
typical
of
a
set
of
data
that
you
want

to
map
into
a
single
row
in
a
“tradi3onal”
spreadsheet
like
view.

Data
blocks
are
typically
contained
within
braces
(curly
brackets);
these
things
:
{
}

Note
that
in
some
machine
readable
data,
some
data
blocks
may
be
contained
within

other
data
blocks…

Each
of
the
items
in
a
single
data
block
can
be
mapped
into
a
separate
cell
–
that
is,
a

separate
column
–
in
a
single
row
of
data.

So
each
data
block
is
a
row,
and
each
item
in
the
block
is
a
column….
OpenReﬁne
will

give
you
a
preview
of
how
the
data
will
look
if
you
click
the
right
bumon!

8

You
can
preview
the
eﬀect
of
making
par3cular
block
selec3ons
using
Update

Preview.

To
return
to
the
block
highlighter,
use
‘Pick
Record
Nodes’.

When
you
are
happy
with
your
selec3on,
you
are
ready
to
“Create
Project”.

9

Once
we’re
happy
with
the
data
preview,
we
can
import
the
data
into
a
more
familiar

looking
layout.

The
arrows
at
the
top
of
each
column
pop
up
menus
that
allow
us
to
run
a
wide

variety
of
opera3ons
on
a
column.

One
of
the
opera3ons
let’s
us
change
the
column
name,
so
I’m
going
to
rename
the

child
company
and
parent
company
columns
to
what
Gephi
expects:
Source
and

Target.

10

This
is
the
format
that
Gephi
wants
to
see
when
we
import
data
from
a
simple
two

column,
comma
separated
variable
(CSV)
text
ﬁle.

One
of
the
columns
needs
to
be
called
Source,
another
needs
to
be
called
Target.

When
construc3ng
the
network
diagram,
Gephi
then
knows
to
draw
a
line
going
from

each
Source
element
to
the
corresponding
Target.

11

The
OpenCorporates
network
data
in
tabulated
form.
The
default
column
names
are

not
necessarily
as
human
readable
as
they
could
be!

In
par3cular,
we
can
iden3fy
the
name
of
the
parent
company
and
the
child
company

for
each
ownership
rela3on.
We
also
have
access
to
the
OpenCorporates
IDs
for
all
of

those
companies.
The
type
of
rela3onship
between
the
companies
is
also
described.

For
the
moment,
we
will
treat
them
all
equally.

(If
you
want
to
view
just
those
company
connec3ons
that
relate
to
a
par3cular
type
of

rela3on,
use
the
Facet
or
Text
Filter
tool
applied
to
the
appropriate
column.)

12

From
the
appropriate
column
menu,
select
“Edit
Column”
and
then
“Rename
this

column”
to
change
the
column
name.

13

We
can
now
export
the
data
using
the
Custom
Tabular
Exporter.

Deselect
all
the
columns
then
select
just
the
Source
and
Target
columns
–
we
will
only

export
data
from
these
two
columns.

14

Preview
your
data
to
check
that
it
looks
like
the
sort
of
data
you
expect
to
export.

From
the
Download
tab,
select
the
CSV
output
type
and
export
your
data
–
it
should

be
saved
into
the
default
download
directory
used
by
your
browser,
with
a
ﬁle
name

that
corresponds
to
the
OpenReﬁne
project
name.

You
should
have
the
two
column
data
saved
to
your
computer
that
you
can
now
load

in
to
Gephi.

15

Gephi
is
a
powerful
cross-‐plaNorm
desktop
tool
for
visualising
data
that
describes

networks,
such
as
social
networks
or
corporate
ownership
networks.
You
can
import

data
into
Gephi
using
specialised
graph/network
representa3on
formats,
or
from

simple
two
column
data
ﬁles
where
each
describes
a
simple
connec3on
between
two

elements
(eg
thing1,
thing2
would
say
that
thing1
connects
to
thing2).

You
can
download
the
Gephi
applica3on
from
gephi.org.
When
you
run
it
on
your

computer,
it
will
launch
a
desktop
applica3on.
Note
that
Gephi
requires
Java
–
if
you

are
on
a
Mac,
you
may
need
to
download
and
install
Java
yourself:
www.java.com

16

Launch
Gephi
(download
it
from
gephi.org
if
you
don’t
already
have
it
installed)
and

select
Data
Laboratory.

If
the
Data
Table
toolbar
is
empty,
go
to
the
applica3on’s
File
menu
and
select
‘New

Project’.
A
new
project
will
be
created
and
you
should
see
several
toolbar
op3ons

appear
in
the
Data
Table.

17

Load
the
data
in
using
the
“Import
Spreadsheet”
tool
op3on.
Make
sure
that
you

select
Edges
table
as
the
table
type.

If
your
data
file
does
not
have
Source
and
Target
column
names,
an
error
will
occur

and
you
will
not
be
able
to
import
the
data
file.
(In
such
a
case,
you
could
always

open
the
file
in
a
text
editor,
change
the
column
names
in
the
file,
save
it,
and
try

again.
Alterna3vely,
go
in
to
OpenRefine,
change
the
column
names
there,
and
re-‐
export
the
custom
tabulated
data…)

18

The
final
stage
of
the
import
gives
some
addi3onal
informa3on
about
how
uploaded

data
will
be
treated.

Because
we
are
simply
loading
in
data
that
describes
how
one
company
(iden3fied
by

its
name)
is
connected
to
another
company
(also
iden3fied
by
its
name),
we
need
to

get
Gephi
to
automa3cally
create
a
node
each
3me
it
sees
a
new
company
(as

iden3fied
by
its
company
name…).

19

When
the
data
is
imported,
we
can
preview
it,
either
by
looking
at
a
list
of
nodes
that

have
been
created,
or
‘edges’
–
that
is,
connec3ons
between
two
companies.

20

So
now
let’s
see
where
we
can
start
to
view
this
data
as
a
network
visualisa3on.

Click
on
the
top
paleme
Overview
bumon
to
get
an
overview
of
the
network
in
visual

form.
This
is
the
area
where
we
can
interac3vely
visualise
the
network.

21

The
default
Overview
layout
has
three
main
areas:

-‐ 
in
the
middle
is
the
canvas
where
we
can
see
the
current
layout
of
the
network;

along
the
les
hand
side
of
the
central
panel
are
several
tools
for
opera3ng
on
the

elements
shown
on
the
canvas;
along
the
bomom
of
the
central
panel
are
several

tools
for
controlling
how
text
labels
are
displayed.

-‐ 
to
the
les
are
several
tools
for
manipula3ng
what
the
network
looks
like:
tools
for

laying
out
the
network
(that
is,
posi3oning
the
nodes)
automa3cally,
as
well
as

colouring
and
sizing
the
nodes;

-‐
to
the
right
are
several
tools
that
allow
us
to
analyse
and
process
the
graph
(that
is,

the
mathema3cal
structure
that
defines
the
network);
for
example,
we
can
run

various
sta3s3cs
on
the
network,
or
filter
the
nodes
that
are
displayed
according
to

one
or
more
specified
criteria.

22

Let’s
start
by
laying
out
the
network.
There
are
several
layout
tools
provided
by

default
(you
can
install
more
from
the
Tools-‐>Plugins

menu)
which
each
have
slightly

different
behaviours
and
can
be
differently
effec3ve
at
laying
out
networks
with

different
sorts
of
structure.

A
couple
of
good
all-‐round
layout
algorithms
are:

-‐ 
ForceAtlas2

-‐ 
Yifan
Hu.

If
you
imagine
connected
nodes
held
together
by
springs,
you
can
thing
of
these

layout
tools
as
trying
to
posi3on
the
nodes
so
that
the
springs
are
stretched
as
limle

as
possible.
Sort
of.

23

At
the
moment,
we
don’t
know
what
each
node
represents.
By
default,
when
labels

are
switched
on,
Gephi
looks
for
a
label
column
value
associated
with
a
node
and

displays
that.
But
we
can
also
display
other
values.
In
this
case,
we
are
using
a

company
name
as
the
node
ID,
so
we
can
select
id
as
the
element
to
display
when
we

switch
labels
on.
Click
on
the
clipboard
icon
on
the
toolbar
at
the
bomom
of
the

screen
to
raise
the
label
selector.

To
actually
switch
labels
on,
click
on
the
lesmost/darket
T
bumon
on
the
toolbar
at

the
bomom
of
the
screen.

The
slider
on
the
right
controls
the
text
label
size.

24

We
can
also
change
the
size
of
labels
propor3onal
to
the
size
of
a
node
–
but
how
do

we
size
nodes?

Whilst
it
is
possible
to
load
in
data
that
describes
various
amributes
associated
with

each
node
(for
example,
in
the
case
of
a
company
node
it
might
be
the
turnover
or

proﬁt
in
the
last
ﬁnancial
year),
we
can
also
generate
informa3on
about
each
node

based
on
various
network
proper3es.

For
example,
the
degree
of
a
node
says
how
many
connec3ons
it
has
with
other

nodes.
Where
connec3ons
are
‘directed’
–
that
is,
represented
by
arrows
–
the

number
of
arrows
that
leave
a
node
is
referred
to
as
the
out-‐degree
of
the
node,
and

the
number
of
arrows
that
come
into
a
node
as
the
in-‐degree.

25

We
can
use
the
Average
Degree
sta3s3c
tool
to
calculate
the
degree,
in-‐degree
and

out-‐degree
values
for
each
node.

We
can
then
use
these
values
as
the
basis
for
sizing
the
nodes
in
the
network

visualisa3on.

26

Here
we
have
sized
the
nodes
by
Degree.
The
min
and
max
size
parameters
can
be
set

as
required
to
scale
the
size
of
the
nodes.

27

We
can
set
the
label
size
so
that
it
is
propor3onal
to
the
node
size
–
from
the
black/
dark
A
label
on
the
toolbar
at
the
bomom
of
the
screen,
select
the
[proporIonal
to]

Node
Size
menu
op3on.

28

As
well
as
tools
for
genera3ng
grandscale
layouts,
there
are
also
layout
tools
for

tweaking
a
par3cular
layout.

The
Expansion
tool
just
stretches
(or
shrinks)
the
layout
in
the
x
and
y
direc3ons.
This

can
be
good
for
just
pu]ng
a
bit
of
space
into
a
layout.

The
Label
Adjust
tool
juggles
nodes
so
that
their
labels
don’t
overlap.
Note
that
this

tool
may
move
some
nodes
quite
a
distance
compared
to
their
neighbours
and
so

may
upset
any
meaningful
spa3al
rela3onships
obtained
using
the
other
layout
tools.

29

We
can
colour
and
size
nodes
according
to
a
wide
range
of
proper3es
obtained
from

running
various
network
sta3s3cs.

As
you
work
with
network
data
more
and
more,
you
start
to
get
a
feel
for
which
tools

to
use
to
help
you
look
for
par3cular
pamerns,
structures
and
stories
within
the
data.

But
that
is
a
tutorial
for
another
day…

30

We
can
use
various
tools
in
concert
to
tweak
the
layout
of
the
network.

In
this
example,
I
have:

-‐ 
sized
the
nodes
by
degree;

-‐ 
set
the
label
sizes
propor3onal
to
the
Degree;

-‐ 
tweaked
the
scale
using
the
text-‐size
slide;

-‐ 
used
the
Authority
value
(obtained
via
the
HITS
sta3s3c)
to
colour
the
nodes;

-‐ 
laid
out
the
network
using
a
ForceAtlas2
algorithm,
a
bit
of
Expansion
and
a
dash
of

Label
Adjust.

31

If
you
want
to
know
more,
contact
us…

32

School of Data - mapping company networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to School of Data - mapping company networks

Similar to School of Data - mapping company networks (20)

More from Tony Hirst

More from Tony Hirst (20)

Recently uploaded

Recently uploaded (20)

School of Data - mapping company networks