Demystifying Containers and Container Images

Dan Čermák

Follow Along

dcermak.github.io/container-images

`who -u`

Dan Čermák

	Software Developer @SUSE
	i3 SIG, Package maintainer
	Developer Tools, Testing and Documentation, Home Automation
	https://dancermak.name
	dcermak
	@Defolos@mastodon.social
	@defolos.bsky.social

Software Delivery: The Real Problem

Dev environment != Production environment
Deploy ⇒ 2 days of debugging 😡

Why can't we just ship the exact environment that works?

So you want a VM?

No…

Attempt #1: Bundle Everything 📦️

Create dev: rsync -avz --exclude=/dev/ / /dev/
Install app: chroot /dev make install
Clean up: chroot /dev/ make clean
Package: tar -czf app.tar.gz /dev

And deploy 🚀

tar -xzf app.tar.gz -C /opt/
chroot /opt/dev/ /usr/local/bin/app.bin

🎉 Success?

Manual process
Huge tar files (entire OS + app) are unwieldy
No process isolation
No resource limits
No network isolation

Attempt #2: Add Process Isolation (Linux Namespaces)

chroot only isolates filesystem, not processes or network: deployed app has full PID & network access
Namespaces provide deeper isolation - exactly what containers need
introduced in 2002 (kernel 2.4), more added in 2006
container support finished in 2013: user namespace with kernel 3.8
user: separate user ids of namespace & host, map uids between host & namespace ⇒ uid 0 in namespace is user who created namespace (→ see also /etc/subuid)
mnt: mount namespace, isolated mounts
pid: Process ID isolation, process that "created" the namespace gets PID 1 and all other processes become its children (also of sub-namespaces)
net: each net inerface in one namespace
ipc: restrict SysV style IPC
uts - unix time sharing: set hostname & domainname
cgroup (added in 4.6): hide cgroup path, i.e. process only sees relative cgroup path of the namespace and no others
time (added in 5.6): set different system time
useful tool: lsns
namespaces can be nested & inherit

Linux Namespaces provide kernel-level resource isolation

user
mnt
pid
net
ipc
uts
cgroup
time

$ unshare --user --map-root-user \
      --pid --fork --mount-proc \
      /bin/bash
# whoami
root
# ps -a
    PID TTY          TIME CMD
      1 pts/8    00:00:00 bash
    104 pts/8    00:00:00 ps

Attempt #3: Add Resource Limits (cgroups)

cgroups (Control Groups) provide resource management:

Apply resource limits to processes
Measure resource usage

# cgcreate -g memory:memlimit
# cgset -r memory.max=1K memlimit
# cgexec -g memory:memlimit ls -al
Killed

The Manual Approach Doesn't Scale

We have:

✔️ Filesystem isolation (chroot)
✔️ Process isolation (namespaces)
✔️ Resource limits (cgroups)

We need:

❌ Standardized container build process
❌ Easy sharing and distribution
❌ Automated namespace/cgroup/FS setup
❌ Simple command-line interface

Introducing: Docker

Standardized build process → Dockerfile
Easy sharing/distribution → Docker Registry
Automated setup → docker run
Simple interface → docker CLI

Container Image Build

FROM registry.opensuse.org/opensuse/tumbleweed
RUN zypper -n in python3
COPY . /src/
RUN pip install .
RUN make test

and we need some CoW

UnionFS

mount -t overlay overlay \
      -o lowerdir=lower_3:lower_2:lower_1,\
         upperdir=upper,workdir=/work/ \
           merged

Container Image Build

docker build .

FROM registry.opensuse.org/opensuse/tumbleweed

COPY . /src/
WORKDIR /src/

RUN zypper -n in python3-pip; \
    pip install . ; \
    zypper -n rm --clean-deps gcc; zypper -n clean; \
    rm -rf {/target,}/var/log/{alternatives.log,lastlog,tallylog,zypper.log,zypp/history,YaST2}

EXPOSE 80
CMD ["/usr/bin/python", "-m", "my-app"]

Dockerfile

FROM - specifies the base image for the current build stage
COPY - copy files from the current build context (the directory passed as last CLI arg) or from other stage to current stage ADD used to fill this use case, but discouraged nowadays
ENV: set environment variables, global for rest of build stage & final image
RUN: execute arbitrary commands in the container image context, using the default shell. Beware of shell escapes when creating multiline strings, often resort to hacks like ksh93 ANSI-C quoting supports also flags like mounting secrets or setting the network
VOLUME: declares a directory as a volume, everything in it is temporary from this layer on, when launching the container a temporary volume is created
WORKDIR: sets the cwd for all subsequent instructions & for entrypoint/cmd
EXPOSE: defines network ports to be exposed, but only documentation. protocol can be specified, defaults to TCP if not supplied. Ports still have to be exposed via -p $hostPort:$ctrPort or all via -P
USER: defines the user for entrypoint & cmd and subsequent RUN instructions, must exist in the image!
CMD: default args for the entrypoint
ENTRYPOINT: defines binary launched as PID 1

additional directives:

ARG - set build arguments, can be passed via --build-arg "USER=me" CLI flag
LABEL: add key-value metadata to the image, common ones: https://github.com/opencontainers/image-spec/blob/main/annotations.md
SHELL: sets the shell, defaults to ["/bin/sh", "-c"]
STOPSIGNAL: which signal should be sent to PID 1 on docker stop (defaults to SIGTERM)

non-standard:

HEALTHCHECK: command to check whether application in container is up
ONBUILD: commands executed when using this image for building

FROM registry.opensuse.org/opensuse/tumbleweed
COPY ./project/ /src/
ENV USER="geeko"
RUN zypper -n in openssh-clients; \
    ssh-keygen -t ed25519 -f /root/.ssh/id_ed25519 -N ""; \
    zypper -n rm --clean-deps openssh-clients; \
    zypper -n clean; rm -rf /var/log/lastlog;
VOLUME ["/src/data"]
WORKDIR /src/
EXPOSE 22
RUN useradd $USER
USER $USER
CMD ["echo hello"]
ENTRYPOINT ["/bin/bash", "-ce"]

Docker Registry

docker pull registry.opensuse.org/opensuse/leap
docker pull registry.opensuse.org/opensuse/leap:15.6
docker pull registry.opensuse.org/opensuse/leap:15.5@sha256:a5ecb8286a6a1b695acb17e63f2702be29f2a72615ec10cfb4e427e2ebc9e8ad

Volumes

docker run -v /vol/:/var/db/ -v logs:/var/log $img

Entrypoint

entrypoint is launched as PID 1 in pid namespace by OCI runtime ⇒ everything in PID namespace becomes child process ⇒ must forward signals to children & reap them

This is why containers are not mini-VMs!

entrypoint should not be a shell ⇒ use the exec form and not the free form to define the ENTRYPOINT, i.e.: ENTRYPOINT ["//bin/foo//", "arg"]
entrypoint gets passed CMD as args by default
entrypoint should handle custom args, e.g. to launch a shell then
exec the actual container process, not just launch it as a subprocess (messes up signal handling)
sign that signal handling is messed up: WARN[0010] StopSignal SIGTERM failed to stop container $FOO in 10 seconds, resorting to SIGKILL
preferably don't run a full init like systemd (hardly doable with docker)
general scheme: support configuration via environment variables

Networking

Best Practices

RUN zypper -n in python3-pip; \
    pip install . ; \
    zypper -n rm --clean-deps gcc; zypper -n clean; \
    rm -rf {/target,}/var/log/{alternatives.log,lastlog,tallylog,zypper.log,zypp/history,YaST2}

$ podman run -e POSTGRES_PORT=1234 \
             -e POSTGRES_USER=pg \
                 my-app
$ podman run my-app bash
#

or:

$ podman run my-app
#

Volumes are your friend:

VOLUME ["/var/db/"]
# /var/db/ is now erased after each step!

use the exec-form:

ENTRYPOINT ["/usr/bin/my-app", "-param", "value"]

Podman

Actually Docker

Podman

Rootless Containers

container runs as non-root or a sub-uid of your user
rootless networking runs in userspace

Security

container potentially as privileged as the user running it
container breakout attacks exist
SELinux is your friend

When to use Containers

Single-process applications
"Works on my machine" problems
Cloud/OS independent deployment
Reproducible environments

When NOT to Use Containers

High-performance I/O applications
Legacy multi-process applications
Desktop GUI applications

Container Orchestration

docker-compose

services:
  app:
    build: .
    ports:
      - "8080:8080"
    volumes:
      - .:/src
    depends_on:
      db:
        condition: service_healthy
  db:
    image: registry.opensuse.org/opensuse/mariadb
    environment:
      - MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1

docker compose up

Quadlet / `podman generate systemd`

[Unit]
Description=TW container

[Container]
Image=registry.opensuse.org/opensuse/tumbleweed

# volume and network defined below in other configs
Volume=test.volume:/data
Network=test.network

Exec=sleep infinity

[Service]
Restart=always
TimeoutStartSec=900

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

Kubernetes

originally started as "Borg" at Google
open sourced 2014, donated to CNCF
declarative configuration via kubernetes yaml
self healing & (auto) horizontal scaling
for microservice architecture (i.e. each container single app)
became quickly industry standard, kubernetes yaml nowadays supported by podman

architecture:

Control Plane (master components):
- API Server: Front-end for the Kubernetes control plane
- etcd: Consistent and highly-available key-value store for all cluster data
- Scheduler: Assigns workloads to nodes
- Controller Manager: Runs controller processes
- Cloud Controller Manager: Integrates with cloud provider APIs
Node Components:
- Kubelet: Ensures containers are running in a pod
- Container Runtime: Software responsible for running containers (Docker, containerd, CRI-O)
- Kube-proxy: Network proxy that maintains network rules on nodes

Key Concepts:

Pods: Smallest deployable units, containing one or more containers
Services: Abstraction that defines a logical set of pods and a policy to access them
Deployments: Manage the deployment and scaling of a set of pods
ConfigMaps/Secrets: Ways to inject configuration into applications
Namespaces: Virtual clusters within a physical cluster
Persistent Volumes: Storage abstraction that outlives pod lifecycle

Common Patterns:

Sidecar: Helper containers that enhance the main container
Ambassador: Proxy local connections to external services
Adapter: Standardizes and normalizes output of the main container
Init Containers: Run before app containers, setting up dependencies
StatefulSets: For stateful applications requiring stable network identifiers and persistent storage
DaemonSets: Ensure that all nodes run a copy of a specific pod
Jobs/CronJobs: Run-to-completion and scheduled tasks
Kubernetes yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-application
  labels:
    app: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web-container
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: "0.5"
            memory: "512Mi"
          requests:
            cpu: "0.2"
            memory: "256Mi"
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5

Should I use Containers?

It depends

Ask yourself:

Do I have environment consistency problems?
Is my app a single process that I can isolate?
Do I need to share/distribute my app environment?
Can I separate data & code in deployment?
Am I willing to learn new deployment patterns?

Questions?

dcermak.github.io/container-images

Demystifying Containers and Container Images

Follow Along

who -u

Agenda

Software Delivery: The Real Problem

Attempt #1: Bundle Everything 📦️

And deploy 🚀

🎉 Success?

Attempt #2: Add Process Isolation (Linux Namespaces)

Attempt #3: Add Resource Limits (cgroups)

The Manual Approach Doesn't Scale

Introducing: Docker

Container Image Build

UnionFS

Container Image Build

Dockerfile

Docker Registry

Volumes

Entrypoint

Networking

Best Practices

Podman

Rootless Containers

Security

When to use Containers

When NOT to Use Containers

Container Orchestration

docker-compose

Quadlet / podman generate systemd

Kubernetes

Should I use Containers?

Questions?

`who -u`

Quadlet / `podman generate systemd`