liminix/doc/admin.adoc

== System Administration

=== Services on a running system

Liminix services are built on s6-rc, which is itself layered on s6.
Services are defined at build time in your configuration (see
<<_services>> for information) and can't be added
to/changed at runtime, but to monitor events or diagnose problems you
may need to inspect them on the running system. Here are some of the
most commonly used s6,-rc commands:

.Service management quick reference
[width="100%",cols="55%,45%",options="header",]
|===
|What |How
|List all running services |`+s6-rc -a list+`

|List all services that are *not* running |`+s6-rc -da list+`

|List services that `+wombat+` depends on
|`+s6-rc-db dependencies wombat+`

|... transitively |`+s6-rc-db all-dependencies wombat+`

|List services that depend on service `+wombat+`
|`+s6-rc-db -d dependencies wombat+`

|... transitively |`+s6-rc-db -d all-dependencies wombat+`

|Stop service `+wombat+` and everything depending on it
|`+s6-rc -d change wombat+`

|Start service `+wombat+` (but not any services depending on it)
|`+s6-rc -u change wombat+`

|Start service `+wombat+` and all* services depending on it
|`+s6-rc-up-tree wombat+`
|===

`+s6-rc-up-tree+` brings up a service and all services that depend on
it, except for any services that depend on a "controlled" service that
is not currently running. Controlled services are not started at boot
time but in response to external events (e.g. plugging in a particular
piece of hardware) so you probably don't want to be starting them by
hand if the conditions aren't there.

A service may be *up* or *down* (there are no intermediate states like
"started" or "stopping" or "dying" or "cogitating"). Some (but not all)
services have "readiness" notifications: the dependents of a service
with a readiness notification won't be started until the service signals
(by writing to a nominated file descriptor) that it's prepared to start
work. Most services defined by Liminix also have a `+timeout-up+`
parameter, which means that if a service has readiness notifications and
doesn't become ready in the allotted time (defaults 20 seconds) it will
be terminated and its state set to *down*.

If the process providing a service dies, it will be restarted
automatically. Liminix does not automatically set it to *down*.

(If the process providing a service dies without ever notifying
readiness, Liminix will restart it as many times as it has to until the
timeout period elapses, and then stop it and mark it down.)

==== Controlled services

*Controlled* services are those which are started/stopped on demand by a
*controller* (another service) instead of being started at boot time.
For example:

* `+svc.uevent-rule.build+` creates a controlled service which is active
when a particular hardware device (identified by uevent/sysfs directory)
is present.
* `+svc.round-robin.build+` creates a service controller that invokes
two or more services in turn, running the next one when the process
providing the previous one exits. We use this for failover from one
network connection to a backup connection, for example.
* `+svc.health-check.build+` creates a service controller that runs a
controlled service and periodically tests whether it is healthy by
running an external health check command or script. If the check command
repeatedly fails, the controlled service is restarted.
+
The Configuration section of the manual describes controlled services in
more detail. Some operational considerations
* `+round-robin+` detects a service status by looking at its `+outputs+`
directory, so it won't work unless the service creates some outputs.
This is considered a bug and will be fixed in a future release
* `+health-check+` works for longruns but not for oneshots, as it
internally relies on `+s6-svc+` to restart the process

==== Logs

Logs for all services are collated into `+/run/log/current+`. The log
file is rotated when it reaches a threshold size, into another file in
the same directory whose name contains a TAI64 timestamp.

Each log line is prefixed with a TAI64 timestamp and the name of the
service, if it is a longrun. If it is a oneshot, a timestamp and the
name of some other service. To convert the timestamp into a
human-readable format, use `+s6-tai64nlocal+`.

[source,console]
----
# ls -l /run/log/
-rw-r--r--    1         0 lock
-rw-r--r--    1         0 state
-rwxr--r--    1     98059 @4000000000025cb629c311ac.s
-rwxr--r--    1     98061 @40000000000260f7309c7fb4.s
-rwxr--r--    1     98041 @40000000000265233a6cc0b6.s
-rwxr--r--    1     98019 @400000000002695d10c06929.s
-rwxr--r--    1     98064 @4000000000026d84189559e0.s
-rwxr--r--    1     98055 @40000000000271ce1e031d91.s
-rwxr--r--    1     98054 @400000000002760229733626.s
-rwxr--r--    1     98104 @4000000000027a2e3b6f4e12.s
-rwxr--r--    1     98023 @4000000000027e6f0ed24a6c.s
-rw-r--r--    1     42374 current

# tail -2 /run/log/current
@40000000000284f130747343 wan.link.pppoe Connect: ppp0 <--> /dev/pts/0
@40000000000284f230acc669 wan.link.pppoe sent [LCP ConfReq id=0x1 <asyncmap 0x0> <magic 0x667a9594> <pcomp> <accomp>]
# tail -2 /run/log/current  | s6-tai64nlocal
1970-01-02 21:51:45.828598156 wan.link.pppoe sent [LCP ConfReq id=0x1 <asyncmap 0x0> <magic 0x667a9594> <pcomp> <accom
p>]
1970-01-02 21:51:48.832588765 wan.link.pppoe sent [LCP ConfReq id=0x1 <asyncmap 0x0> <magic 0x667a9594> <pcomp> <accom
p>]
----

===== Log persistence

Logs written to `+/run/log/+` will not survive a reboot or crash, as it
is an ephemeral filesystem.

On supported hardware you can enable logging to
https://www.kernel.org/doc/Documentation/ABI/testing/pstore[pstore]
which means the most recent log messages will be preserved on reboot.
Set the config option `+logging.persistent.enable = true+` to log
messages to `+/dev/pmsg0+` as well as to the regular log. This is a
circular buffer, so when it fills up newer messages will overwrite the
oldest messages.

Logs found in pstore after a reboot will be moved at startup to
`+/run/log/previous-boot+`

=== Updating an installed system

If your system has a writable root filesystem (JFFS2, btrfs etc
-anything but squashfs), we have mechanisms for in-places updates
analogous to `+nixos-rebuild+`, but the operation is a bit different
because it expects to run on a build machine and then copy to the host
device using `+ssh+`.

To use this, build the `+outputs.updater+` target and then run the
`+update.sh+` script it generates.

[source,console]
----
nix-build -I liminix-config=./my-configuration.nix \
   --arg device "import ./devices/mydevice" \
   -A outputs.updater
./result/bin/update.sh root@the-device
----

The update script uses min-copy-closure to copy new or changed packages
to the device, then (perhaps) reboots it. The reboot behaviour can be
affected by flags:

* [.title-ref]#--no-reboot# will cause it not to reboot at all, if you
would rather do that yourself. Note that none of the newly-installed or
updated services will be running until you do.
* [.title-ref]#--fast# causes it tn not do a full reboot, but instead to
restart only the services that have been changed. This will restart all
of the services that have updated store paths (and anything that depends
on them), but will not affect services that haven't changed.

It doesn't delete old packages automatically: to do that run
`+min-collect-garbage+`, which will delete any packages not in the
current system closure. Note that Liminix does not have the NixOS
concept of environments or generations, and there is no way back from
this except for building the previous configuration again.

==== Caveats

* it needs there to be enough free space on the device for all the new
packages in addition to all the packages already on it - which may be a
problem if there is little flash storage or if a lot of things have
changed (e.g. a new version of nixpkgs).
* it may not be able to upgrade the kernel: this is device-dependent. If
your device boots from a kernel image on a raw MTD partition or or UBI
volume, update.sh is unable to alter the kernel partition. If your
device boots from a kernel inside the filesystem (e.g. using
bootloader.extlinux or bootloder.fit) then the kernel will be upgraded
along with the userland

==== Recovery/downgrades

The `+update.sh+` script also creates a timestamped symlink on the
device which points to the system configuration it installs. If you
install a configuration that doesn't work, you can revert to any other
installed configuration by

[arabic]
. booting to some kind of rescue or recovery system (which may be some
vendor-provided rescue option, or your own recovery system perhaps based
on `+examples/recovery.nix+`) and mounting your Liminix filesystem on
/mnt
. picking another previously-installed configuration that _link:[did]
work, and switching back to it:

[source,console]
----
# ls  -ld /mnt/*configuration
lrwxrwxrwx    1        90 /mnt/20252102T182104.configuration -> nix/store/v1w0h4zw65ah4c2r0k7nyy125qrxhq78-system-configuration-aarch64-unknown-linux-musl
lrwxrwxrwx    1        90 /mnt/20251802T181822.configuration -> nix/store/wqjl9s9xljl2wg8257292zghws9ssidk-system-configuration-aarch64-unknown-linux-musl
# : 20251802T181822 is the working system, so reinstall it
# /mnt/20251802T181822.configuration/bin/install /mnt
# umount /mnt
# reboot
----

This will install the previous configuration's activation binary into
/bin, and copy its kernel and initramfs into /boot. Note that it depends
on the previous system not having been garbage-collected.

==== Adding packages

If you simply wish to add a package without any change to services, you
can call `+min-copy-closure+` directly to install any package in Nixpkgs
or in the Liminix overlay

[source,console]
----
nix-build -I liminix-config=./my-configuration.nix \
 --arg device "import ./devices/mydevice" -A pkgs.tcpdump

nix-shell -p min-copy-closure root@the-device result/
----

Note that this only copies the package and its dependencies to the
device: it doesn't update any profile to add it to `+$PATH+`

[reftext="Levitate"]
[[levitate]]
=== Levitate: Reinstalling on a running system

Liminix is initially installed from a monolithic `+firmware.bin+` - and
unless you're running a writable filesystem, the only way to update it
is to build and install a whole new `+firmware.bin+`. However, you
probably would prefer not to have to remove it from its installation
site, unplug it from the network and stick serial cables in it all over
again.

It is not (generally) safe to install a new firmware onto the flash
partitions that the active system is running on. To address this we have
`+levitate+`, which a way for a running Liminix system to "soft restart"
into a ramdisk running only a limited set of services, so that the main
partitions can then be safely flashed.

==== Configuration

Levitate _needs to be configured when you create the initial system_ to
specify which services/packages/etc to run in maintenance mode. Most
likely you want to configure a network interface and an ssh for example
so that you can login to reflash it.

[source,nix]
----
defaultProfile.packages = with pkgs; [
  ...
  (levitate.override {
    config  = {
      services = {
        inherit (config.services) dhcpc sshd watchdog;
      };
      defaultProfile.packages = [ mtdutils ];
      users.root = config.users.root;
    };
  })
];
----

==== Use

Connect (with ssh, probably) to the running Liminix system that you wish
to upgrade.

[source,console]
----
bash$ ssh root@the-device
----

Run `+levitate+`. This takes a little while (perhaps a few tens of
seconds) to execute, and copies all config required for maintenance mode
to `+/run/maintenance+`.

[source,console]
----
# levitate
----

Reboot into maintenance mode. You will be logged out

[source,console]
----
# reboot
----

Connect to the device again - note that the ssh host key will have
changed.

[source,console]
----
# ssh -o UserKnownHostsFile=/dev/null root@the-device
----

Check we're in maintenance mode

[source,console]
----
# cat /etc/banner

LADIES AND GENTLEMEN WE ARE FLOATING IN SPACE

Most services are disabled. The system is operating
with a ram-based root filesystem, making it safe to
overwrite the flash devices in order to perform
upgrades and maintenance.

Don't forget to reboot when you have finished.
----

Perform the upgrade, using flashcp. This is an example, your device will
differ

[source,console]
----
# cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00030000 00010000 "u-boot"
mtd1: 00010000 00010000 "u-boot-env"
mtd2: 00010000 00010000 "factory"
mtd3: 00f80000 00010000 "firmware"
mtd4: 00220000 00010000 "kernel"
mtd5: 00d60000 00010000 "rootfs"
mtd6: 00010000 00010000 "art"
# flashcp -v firmware.bin mtd:firmware
----

All done

[source,console]
----
# reboot
----