This is a resource part of Kris Coppieters' Tech Forum presentation: https://youtu.be/3OW96zE6N2I
Do you find yourself repeating the same task over and over? Or feeling certain there is a way to automate a task but it's just outside of your skill set? Kris Coppieters from Rorohiko has built a career solving just those kinds of problems. Whether it's scripting a solution from within InDesign, or using AppleScript to finish off some markup, Kris can show you how to bring high-level thinking to quick and dirty tasks.
techforum.booknetcanada.ca
#TechForum #ebookcraft
Uncommon Grace The Autobiography of Isaac Folorunso
Better problem solving through scripting: How to think through your #eprdctn roadblocks - Course notes
1. Better problem solving through scripting: How to think through your
#eprdctn roadblocks and script your way to efficiency
Do you find yourself repeating the same task over and over? Or feeling certain there is a
way to automate a task but it's just outside of your skill set? Kris Coppieters from Rorohiko
has built a career solving just those kinds of problems. Whether it's scripting a solution from
within InDesign, or using AppleScript to finish off some markup, Kris can show you how to
bring high-level thinking to quick and dirty tasks.
Two categories of automation
When it comes to automation, we can classify most of the software used into two large
categories which I'll call tools and systems.
Tools
Tools are smaller programs which are similar to tools used in various crafts, e.g. carpentry.
In carpentry, you might use a hammer, nails, screws, screwdrivers, saws, CNC machines,
drills... A few of these tools are more complex, and have a wide range of functions, but
many tools are simple and have a very specific function. Some of these tools are specifically
made for a specific task.
In #eprdctn, the tools used are text editors, search-and-replace, XML editors like Oxygen,
testing tools, reader programs...
Systems
Systems are more complex than simple tools. They take in some raw materials and spit out
finished results at the other end.
A sawmill can be seen as a system. It processes logs and produces planks, poles, boards...
The people working in the sawmill will use machines and tools to make the system work.
A working system can encompass many processes and sub-processes. Some processes are
automated, some are manual processes.
In #eprdctn, we might have a system that takes in raw data, and produces ebooks. In
#eprdctn publishing in general, systems are often called workflows.
Let's talk about tools and jigs
This presentation is all about tools for your craft.
There's a guy called Dan Erlewine on YouTube, who is a luthier. He has many YouTube
movies on guitar repair, and many of those movies are about clever self-made tools he uses
to work on guitars. He calls them 'jigs'.
https://www.youtube.com/results?search_query=dan+erlewine+jig
I like that term, so I want to coin the term for 'jig' for #eprdcnt custom tools. You've heard it
here first!
The first step is to take notice when you find yourself repeatedly performing a cumbersome,
difficult, or repetitive task. Then ask yourself whether it's possible to create a jig for that.
Basic Tools
When working in #eprdctn, a lot of work involves editing various text files. More often than
not, the editing will aim to affect the structure of the document, rather than the content:
retagging, restructuring tags, managing CSS classes,...
Another common operation will be unpacking and repacking files. For example, EPUB files
are nothing more than glorified .ZIP files.
2. Regular Expressions
One of the tools you want in your toolchest some familiarity with regular expressions. Once
you master regular expressions, you can use search-and-replace for a fairly wide range of
tasks.
Regular expressions can help with tasks that go beyond a simple search-and replace; for
example you could use a search and replace with regular expressions for restructuring
HTML.
Regular expressions are not easy to master. They are very cryptic, almost impossible to
read, and they are not standardized.
Regular expressions are also often loosely referred to as 'GREP' which is a reference to the
Unix command line tool which started it all. GREP = Global Regular Expression Print.
Lack of standardization
You might be using a text editor like BBEdit, Notepad++, Sublime Edit,... All of these support
regular expressions, but each of them will its own unique 'dialect'. They'll be 95% the same
between the different text editors, but there are subtle differences.
You might be using InDesign as the source for EPUB documents. InDesign supports regular
expressions in its Find-Change dialog. And these are in a specific InDesign 'dialect' which has
only 80% similarity when compared to the regular expressions used common text editors.
InDesign has some fairly unique features with regards to regular expressions, features which
you won't find in your text editor.
It does not end there. InDesign supports a scripting language called ExtendScript which is a
form of JavaScript. ExtendScript supports regular expressions, and guess what? They use yet
another dialect of regular expressions, again quite different to the InDesign regular
expressions as seen in the Find-Change dialog.
Then we have the various scripting languages that could be used for tool creation - PHP,
Perl, Python, JavaScript, awk, sed... There are many 10s of them. None of them is 'better' -
whatever works, works.
Again, all of them have regular expressions, and each scripting language will have its own
unique dialect.
Understand the basic principle and use the documentation
To make sense of it all, my recommendation is that you must understand the basic ideas of
regular expressions and how they are constructed. These are well supported in all the
dialects.
Once you understand the basic ideas, you need to consult the documentation and/or use
the facilities of the software at hand to determine the proper expressions.
To match a thin space, for example:
InDesign: ~<
Most text editors: x{2009}
Referring to matched parenthesized sub-expressions in replacement patterns is another
point of difference. For example, sometimes, you need to use $1, sometimes you need to
use 1 to refer to the first parenthesized subexpression.
Text Editors
Another basic tool you need is a (set of) good text editors. Steer clear of word processors or
underpowered tools: don't use MS Word, Apple's TextEdit, Notepad.exe ... None of these
are proper text editors.
Word processors often try to be helpful and 'helpfully' change quotes into curly quotes, or
muck with line endings and character encodings, blithely destroying your HTML structure.
3. Some editors are much more than that. If you can afford it, you want to have Oxygen XML in
your tool chest. This tool can serve as a text editor but it also understands XML, HTML, CSS...
and will allow you to do much smarter editing with much reduced risk.
Editing XML files in a regular text editor works just fine, but you run the risk of damaging
some finely tuned tag balance, and never know it.
Some text editors (e.g. BBEdit, Oxygen, Atom...) are smart enough to handle text files inside
ZIP-ped data files (e.g. EPUB files) and you can edit text files inside an EPUB without needing
to 'crack it open'.
Scripting Languages
Another powerful basic tool is to have a basic understanding of some scripting language or
languages. There are many out there: the most popular ones are probably Python,
JavaScript, PHP, Perl.
Most of these scripting languages offer the necessary support for complex operations, e.g.
zipping and unzipping, regular expressions, search-and-replace, XML parsing, accessing data
over HTTP or HTTPS connections, connecting to databases...
A scripting language comes in handy when you're faced with a repetitive task that's going
beyond what you can accomplish with find-and-replace and regular expressions. For
example, when there is some 'if-then' logic that needs to handled, or some processing that
needs to be done.
Tools
EPUB unpack/repack: eCanCrusher
When working with EPUB, if you have access to tools like BBEdit or Oxygen, you can perform
EPUB-wide operations like straight on the EPUB file without ever having to
decompress/recompress it.
But sometimes you want to decompress the EPUB, make some changes, then re-compress
it.
There are multiple tools that do this. eCanCrusher is one of them. It works in simple
drag/drop fashion.
https://www.docdataflow.com/ecancrusher/
To decompress: drag/drop an EPUB file onto the eCanCrusher application icon. A
decompressed EPUB folder will appear.
To recompress: drag/drop an EPUB folder onto the eCanCrusher application icon. A
compressed EPUB file will appear.
To configure: double-click the eCanCrusher application icon.
4. Custom Scripts
Another set of tools on your toolbelt can be custom scripts written in a variety of scripting
languages.
Languages like Python, PHP, JavaScript/Node.js... can be used to write scripts that process
individual text files (e.g. XHTML, CSS,...) or complete EPUB.
None of these is particularly better or worse, and switching to a different language than the
one you already know is rarely beneficial.
All of these scripting languages have features to facilitate handling of XML, pattern
matching, and so on.
There are two hurdles to writing scripts: first of all, installing and configuring the software to
use the scripting language is not always straightforward.
Second, writing scripts is not for the faint of heart, but the rewards are tremendous.
Pick one language, get good at it.
Macintosh
On a standard Mac, some common scripting languages are pre-installed (e.g. PHP, Python
2.7). Installing additional languages is straightforward: open the Terminal window
(Application -> Utilities -> Terminal.app) then invoke the scripting language. For common
scripting languages the Mac will propose to download and install the necessary command
line tools. In the screenshot below, I've just typed python3
As python3 is not installed by default, the Mac offers to fetch and install it:
For node.js (JavaScript) you need to visit and download from https://nodejs.org
5. Windows
Windows does not have the more common scripting languages pre-installed. There are
many options to download and install these.
One of the many options is Cygwin, which installs a 'Unix-like' environment on Windows and
allows you to use the same command-line tools as Mac and Linux users:
https://cygwin.com
When you run the Cygwin installer, you'll see a window where you can pick-and-choose and
decide which Linux/Unix tools to install.
I find it easiest to install all PHP-related and all Python-related stuff, and things like zip and
unzip.
6. Rather than try and be selective, I simply search for 'PHP' and/or 'Python' in the package list
and select whole 'PHP' and 'Python' collection.
7.
8. To install Node.js you need to download and run an installer from https://nodejs.org
Creating a script
There are many ways to go about this, and I won't even attempt to list all of them.
Instead, I'll be creating a very simple script from scratch and will run it on some XHTML files.
For the sake of argument, my task is to go through an EPUB file and find the style attribute
associated with the <body> tag, and remove it. Instead, I'll move that style attribute into the
CSS file for the body tag.
The first step is to experiment a bit. I'll can decompress the EPUB using eCanCrusher, or I
can use an EPUB-aware text editor like BBEdit, Atom, Oxygen XML.
I'll open one of the xhtml files and set out to find the regular expression pattern that works.
Eventually, I came up with:
(body[^>]*) style="[^"]*"([^>]*>)
to be replaced by
12
or
$1$2
depending on the dialect of GREP your text editor is using.
If we only need to do one EPUB, we could simply do a search-and-replace all.
But we want to do this to many EPUBs.
The next step is to create a script that will take a file name as a command line parameter,
which then reads the file, performs the search-and-replace, and overwrites the file with the
updated file.
I created a file deleteBodyStyle.php which has the following script:
9. <?php
$fileContents = file_get_contents($argv[1]);
$fileContents = preg_replace('/(<body[^>]*)
style="[^"]*"([^>]*>)/','$1$2',$fileContents);
file_put_contents($argv[1], $fileContents);
This script reads the file content of the file at hand (file path is provided as $argv[1]), does
the search-and-replace, and writes out the updated file contents.
I also created the equivalent Python version in a file deleteBodyStyle.py:
import sys
import re
with open(sys.argv[1], 'r') as inFile:
data = inFile.read()
data = re.sub(r'(<body[^>]*) style="[^"]*"([^>]*>)', r'12', data)
with open(sys.argv[1], 'w') as outFile:
outFile.write(data)
We can now test these scripts. I will be using a sample EPUB made by means of Adobe
InDesign 2020 from an InDesign sample file called Adobe History.indd that came with
InDesign CS3. It is an excerpt from the book Inside the Publishing Revolution: The Adobe
Story by Pamela Pfiffner.
Before looking at DropToScript, I'll first use the scripts in a manual fashion. That's slightly
cumbersome, but we need to go through it to get a good understanding of what is going on.
I crack open the EPUB exported from InDesign, and then I'll use Terminal on Mac (or Cygwin
Terminal on Windows) to execute the scripts from the command line, using drag/drop to
avoid having to type the path to the xhtml files.
cd Desktop
php deleteBodyStyle.php /Users/kris/Desktop/Adobe
History/OEBPS/Adobe_History-6.xhtml
python deleteBodyStyle.py /Users/kris/Desktop/Adobe
History/OEBPS/Adobe_History-7.xhtml
To execute either of these scripts on all .xhtml files, we can use some command-line magic.
or
dirToScan="/Users/kris/Desktop/Adobe History/OEBPS/"; ls
"$dirToScan"*.xhtml | while read fileToScan; do python
deleteBodyStyle.py "$fileToScan"; done
After adjusting the CSS file, we can recompress the EPUB.
I've not done any error handling. It would be cleaner to also add additional checks (e.g.
verify the file name extension of the file being processed) and error checks (e.g. report any
unexpected circumstances), but for most intents and purposes the above script will work
fine.
DropToScript Script Wrapper
Once you have a script, you'll often find that you're going through the same motions over
and over:
• Decompress EPUB
• Run the script on a bunch of files (e.g. html files or css files).
• Repackage EPUB
DropToScript manages the decompress/repackage part automatically. Once you have a
script (in PHP, Python, Node JavaScript...)that can process a single file at a time, you can
10. configure DropToScript to automatically perform the same script on many files, simply by
dragging an EPUB or a collection of file icons onto the DropToScript icon.
DropToScript comes bundled with a number of pre-made useful scripts, but you can easily
add your own.
After downloading it, you need to configure it so it can find the PHP or Python installation
on your computer. You do this by double-clicking the icon of the application.
As an example, you can copy either the deleteBodyStyle.php or deleteBodyStyle.py file into
the DropScripts folder and then drag-drop any EPUB onto DropToScript to have
deleteBodyStyle executed on the text files inside the EPUB.
Stuff I use
cd_to (Mac): https://github.com/jbtule/cdto
Cygwin: https://www.cygwin.com/
Atom Text Editor: https://atom.io/
Notepad++ Text Editor: https://notepad-plus-plus.org/
eCanCrusher: https://www.docdataflow.com/ecancrusher/
DropToScript: https://github.com/BCLibCoop/nnels-a11y-publishing/tree/kris-
enhancements-20200318/ReleaseVersions
Guitar Jigs: https://www.youtube.com/results?search_query=Dan+Erlewine+jig
Inside the Publishing Revolution: The Adobe Story:
https://www.amazon.com/Inside-Publishing-Revolution-Adobe-Story/dp/0321115643