Build optimization mechanisms in GitLab and Docker

Build optimization
mechanisms in
GitLab and Docker
by Dmytro Patkovskyi

Dmytro Patkovskyi
Software Engineer at Grammarly
(Core Services team, Java / backend)
Past experience: Amazon, Ciklum, Grammatica.eu
About me

Our CI/CD infrastructure (2019)
GitLab Enterprise
(in-house, AWS)
JFrog Artifactory
(in-house, AWS)
Docker registry
and artifact storage
300+ repositories
in 19 groups
Number of runners
is set per group
AWS ECS
(on EC2 instances)
Deployments

● Reproducibility
● Simplicity
● Speed
Build goals
Clean build speed
Incremental build speed (incrementality)

Docker
Build optimizations available:
Version 18.09
● Layer cache
… and that’s it!
But is it used in your CI builds?
No, if you’re asking this question.

How to enable layer cache
on CI (GitLab)
+
Change this:
script:
- docker build
--build-arg VERSION=$VERSION
--tag $IMAGE:$VERSION .
- docker push $IMAGE:$VERSION
To this:
script:
- docker pull $IMAGE:cache || true
- docker build
--tag $IMAGE:$VERSION
--tag $IMAGE:cache
--cache-from $IMAGE:cache .
- docker push $IMAGE:cache

Key: current image + current instruction.
Value: next layer.
Invalidation: cache-miss on one instruction invalidates cache for all
instructions below.
Layer cache in a few words

● Pull speed
● Build speed
● Push speed
● Storage space
Spend less money and time!
Why layer cache matters

Optimal layer structure
Changes rarely
...
...
Changes most frequently
Change frequency increases
Size & build time decreases
First layer
Last layer
...

Proper instruction order
Inefficient order:
ARG VERSION
COPY nodeserver /opt/nodeserver
ADD /distributions/project-$VERSION.tar /opt
RUN cd /opt/nodeserver && ./install.sh
takes 60s and
rarely changes
changes on every commit
changes on every
commit
rarely changes
Efficient order:
COPY nodeserver /opt/nodeserver
RUN cd /opt/nodeserver && ./install.sh
ARG VERSION
ADD /distributions/project-$VERSION.tar /opt
Result: 60s saved on each build

Chain install & cleanup cmds
Why?
Files created in some layer end up increasing image size even when you
delete them in another layer.
How?
RUN apt-get update && apt-get install ...
RUN rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install … && rm -rf /var/lib/apt/lists/*

Use .dockerignore
Why?
To avoid cache invalidations in ADD/COPY due to irrelevant file changes.
Also, smaller Docker context => faster build start.
How?
Add file masks of all irrelevant files (e.g., readme, IDE files) to .dockerignore.

Separate code layer from
dependency layer
Why?
Your compiled code is smaller and changes faster than dependencies.
How?
For Java: don’t put fat jars in your image. Use Google jib plugin or manually
extract dependencies into a separate layer.
Our experience:
70mb fat jar layer that changes on each commit (2s ECS pull) =>
200kb code layer that changes on each commit (100ms ECS pull).

Multi-stage Docker build
Why?
To avoid any build-time clutter in the final image
(reduce size, optimize layer structure).
How?
Next slides.

Multi-stage Docker build:
Dockerfile
FROM <build-time base image> as builder
# build & run tests
# …………...…….
# …………...…….
# …………...…….
# …………...…….
FROM <run-time base image>
COPY --from=builder <dependencies> ...
COPY --from=builder <compiled code> ...
ENTRYPOINT ...

How to enable layer cache
for multi-stage Docker build on CI
+
Run this for builder image
- docker pull $IMAGE:builder || true
- docker build
--tag $IMAGE:builder
--cache-from $IMAGE:builder
--target builder .
- docker push $IMAGE:builder
… and then this for final image
- docker pull $IMAGE:cache || true
- docker build
--tag $IMAGE:$VERSION
--tag $IMAGE:cache
--cache-from $IMAGE:builder .
--cache-from $IMAGE:cache .
- docker push $IMAGE:cache

Alternative:
multi-stage CI build
FROM <run-time image>
ADD <dependencies> ...
ADD <compiled code> ...
ENTRYPOINT ...
artifacts from previous CI stage(s),
built & tested separately

Multi-stage Docker build
vs. multi-stage CI build
Multi-stage Docker build:
+ fits into any CI system
+ easy migration between CIs
+ easy to reproduce locally
- poor integration with CI
- hard to modularize
- long Dockerfiles
Multi-stage CI build: the exact opposite.

Docker checklist
● Use lightweight base images
● Check your instruction order
● Chain install & cleanup commands
● Use .dockerignore
● Split code layer from dependency layer
● Use multi-stage build (in Docker or CI)
Recommended tools: Dive, Jib (for Java projects)
https://github.com/wagoodman/dive
https://github.com/GoogleContainerTools/jib

GitLab
Optimization features:
● Artifacts
● Cache (local or shared)
● Persistent volumes
Version 12.4.0-ee

GitLab job executors
● SSH
● Shell
● VirtualBox
● Parallels
● Docker-machine
● Docker (used in Grammarly)
● Kubernetes
● Custom
covered in this talk

Syntax examples (.gitlab-ci.yml)
Artifacts Cache
Persistent volume
(/cache in this case)
artifacts:
paths:
- test-report
- distributive
expire_in: 1 week
when: always
cache:
key: somekey
paths:
- .gradle
script:
- mv myfile /cache

GitLab concepts
Git
commits
c06c4c91
31137606
GitLab
Pipeline for c06c4c91
Stage 1 Stage 2
Stage 3
Job A
Job B
Job C
Job D
Job E
Job F
Pipeline for 31137606
…...
Cache and
persistent volumes
Unlike artifacts,
can be passed
between pipelines

Use artifacts to avoid
work duplication in jobs
of a single pipeline

Anti-pattern #1: unintentional
artifact downloads
By default, a job downloads all artifacts from all previous stages of a
pipeline.
If your job doesn’t need any artifacts:
dependencies: []
If your job needs artifacts from jobs A and C:
dependencies:
- A
- C
Our experience: 70mb jar x 3 jobs ≈ 15 seconds saved

1. Shared
2. Local
The choice is made when setting-up runners.
cache: syntax in .gitlab-ci.yml remains the same.
Cache type choice
Shared cache + local persistent volume = best of both worlds.

How shared cache works
It’s very simple:
1. Download & extract zip from S3 / GCS based on cache:key.
2. Execute job scripts.
3. Pack files under cache:paths into a new zip & upload.
…maybe too simple?

● Not an rsync
● Transfer never skipped
Shared cache gotchas
The whole cache.zip is downloaded
and uploaded every time a job runs
Minor:
● No automatic cleanup of unused files
● Absolute paths to cached files are different across runners unless
you set $GIT_CLONE_PATH to a runner-independent value

● Dynamic storage
● Host-bound persistent storage
● Persistent volume claims (PVC)
● Host path volume
Persistent volume options
Docker executor
Kubernetes executor

Shared cache trade-off:
build time
vs.
transfer time

Local persistent volume
trade-off:
number of runners vs.
cache freshness

Fresh — for build #5, only cache produced by build #4 is fresh.
Stale — for build #5, caches produced by builds #1, #2, #3 are stale.
Freshness — for build #5, cache from #3 is more fresh than from #1.
Define “cache freshness”

Shared cache vs Persistent vol.
Shared cache Local persistent volume
Fresh on all runners.
Fresh only on one runner.
More runners => less freshness.
Bigger cache => longer transfer. No time penalty on size.

Anti-pattern #2: using shared
cache for dependencies
cache:
key: $CI_PROJECT_NAME
paths:
- .gradle/caches
Download + unzip + zip + upload time ≈ no benefit from caching
dependencies.
Use local persistent volume to cache library dependencies.
Our experience: 500mb cache ≈ 50 seconds saved

So, when do you use each option?
Artifacts Shared cache Persistent volume
Pass files between jobs
of a single pipeline to
avoid work repetition.
When fresh cache is
required for speed-up.
When cache is small.
When stale cache also
provides speed-up.
When cache is big.

Maryna Veremenko — help & support with GitLab
Dima Shevchuk — help & support with GitLab
Sasha Marynych — motivation :-)
Special thanks!

Build optimization mechanisms in GitLab and Docker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build optimization mechanisms in GitLab and Docker

Similar to Build optimization mechanisms in GitLab and Docker (20)

Recently uploaded

Recently uploaded (20)

Build optimization mechanisms in GitLab and Docker

Editor's Notes