Introduction talk at the University of Strathclyde (Scotland) Algorithms Workshop, providing a quick overview of the fundamental and practical reasons why algorithms are/are not technical black boxes. (This talk does not address issues of trade secret or other business reasons for lack of transparency). The presentation was given to an audience of academics and students at the law department.
3. • Technical issues
– Fundamental
– Practical
• Business/management interests, e.g. trade secrets
How a decision was reached is in principle possible to be
revealed, if sufficient data about the state of the system at
time of operation is available (and we have access to the
code)
Why a particular chain of operations was done is much more
difficult (especially with ML)
Origins of the ‘black box’
3
4. • Machine Learning (ML) • Hand coded
Fundamental properties
O1=f(w1,H1,w2,H2,w3,H3)
5. Machine Learning
• If parameters & data are
known, we can trace how
the output is computed.
• If history of data is known,
we can in principle trace
how parameters were set.
• Explaining why certain
parameters are optimal can
be very difficult.
=> Explaining why output is
produced is difficult.
Hand Coded
• If parameters & data are
known, we can trace how
the output is computed.
• We known the parameters
were set by engineers.
• We can ask the engineers
why certain parameters
were chosen.
=> Explaining why output is
produced depends on the
engineers.
Fundamental transparency: how vs. why
6. • High dimensionality of Big Data algorithms can make
interpretation of the ‘explanation’ problematic
– e.g. Google page ranking algorithms is estimated to
involve 200+ parameters
• Approximated transparency through dimensionality
reduction, e.g. Principle Component Analysis (PCA)
– requires case-by-case analysis depending on input data
– ‘general’ solution only valid for the ‘majority case’
conditions
High dimensionality, a.k.a. when an
explanation is not transparent
6
7. Machine Learning
• If Machine Learning
algorithms use ‘in situ’
continuous or intermitted
learning, the parameter
setting change over time.
• To re-create a system
behaviour requires
knowledge of the past
parameter states.
Hand Coded
• Hand coded systems are
also frequently updates,
especially if there is an
‘arms race’ between the
service provider and users
trying to ‘game’ the system
(e.g. Google search vs.
Search Engine Optimization)
Practical issues: non-static algorithms
In some cases, randomness might be built into an
algorithm’s design meaning its outcomes can never be
perfectly predicted.
8. • Defining precisely what a task/problem is (logic)
• Break that down into a precise set of instructions, factoring
in any contingencies, such as how the algorithm should
perform under different conditions (control).
• “Explain it to something as stonily stupid as a computer”
(Fuller 2008).
• Many tasks and problems are extremely difficult or
impossible to translate into algorithms and end up being
hugely oversimplified.
• Mistranslating the problem and/or solution will lead to
erroneous outcomes and random uncertainties.
The challenge of translating a
task/problem into an algorithm
8
9. System design in the real world
9
https://effectivesoftwaredesign.com/2012/04/23/communication-problems-in-software-projects/
10. • Algorithm are created through: trial and error, play,
collaboration, discussion, and negotiation.
• They are teased into being: edited, revised, deleted and
restarted, shared with others, passing through multiple
iterations stretched out over time and space.
• They are always somewhat uncertain, provisional and messy
fragile accomplishments.
• Algorithmic systems are not standalone little boxes, but
massive, networked ones with hundreds of hands reaching
into them, tweaking and tuning, swapping out parts and
experimenting with new arrangements.
Algorithm creation
10
Gillespie, T. (2014a) The relevance of algorithms, in Media Technologies: Essays on Communication, Materiality, and Society, ed. by
Gillespie, T., Boczkowski, P.J. and Foot, K.A. Cambridge, MA: MIT Press, pp.167-93.; Neyland, D. (2014) On organizing algorithms. Theory,
Culture and Society, online first. cited in: Kitchin, Rob, and Martin Dodge. 2017. “The (in)security of Smart Cities: Vulnerabilities, Risks,
Mitigation and Prevention.” SocArXiv. February 13. osf.io/preprints/socarxiv/f6z63.
11. • Deconstructing and tracing how an algorithm is constructed
in code and mutates over time is not straightforward.
• Code often takes the form of a “Big Ball of Mud”: “[a]
haphazardly structured, sprawling, sloppy, duct-tape and
bailing wire, spaghetti code jungle”.
Examining pseudo-code/source code
11
Foote, B. and Yoder, J. (1997) Big Ball of Mud. Pattern Languages of Program Design 4: 654-92; cited in
Kitchin, Rob, and Martin Dodge. 2017. “The (in)security of Smart Cities: Vulnerabilities, Risks, Mitigation and Prevention.” SocArXiv. February 13.
osf.io/preprints/socarxiv/f6z63.
12. • Reverse engineering is the process of articulating the
specifications of a system through a rigorous examination
drawing on domain knowledge, observation, and deduction
to unearth a model of how that system works.
• By examining what data is fed into an algorithm and what
output is produced it is possible to start to reverse engineer
how the recipe of the algorithm is composed (how it weights
and preferences some criteria) and what it does.
Reverse engineering
12
13. • HOW
– With access to the code, data and parameter settings,
HOW the output was produced can be ‘explained’.
– High dimensionality can make the ‘explanation’ difficult
to understand.
– Dimensionality reduction can help to generate an
approximate explanation that is understandable.
• WHY
– Can be (very) difficult to determine, especially if Machine
Learning methods are used.
– Approximate explanation based on the manually set
optimization targets can help.
Conclusion
13
researchers might search Google using the same terms on multiple computers in multiple jurisdictions to get a sense of how its PageRank algorithm is constructed and works in practice (Mahnke and Uprichard 2014), or they might experiment with posting and interacting with posts on Facebook to try and determine how its EdgeRank algorithm positions and prioritises posts in user time lines (Bucher 2012), or they might use proxy servers and feed dummy user profiles into e- commerce systems to see how prices might vary across users and locales (Wall Street Journal, detailed in Diakopoulos 2013).