Skip to content
This repository was archived by the owner on May 6, 2020. It is now read-only.

rfc 3.0 architecture documentation#543

Merged
mcastelino merged 1 commit intoclearcontainers:masterfrom
egernst:3.0-architecture
Sep 15, 2017
Merged

rfc 3.0 architecture documentation#543
mcastelino merged 1 commit intoclearcontainers:masterfrom
egernst:3.0-architecture

Conversation

@egernst
Copy link
Copy Markdown

@egernst egernst commented Sep 12, 2017

This is very much work in progress, and shouldn't be merged in this state. I'm pushing this as a PR so folks can easily get their eyes on the doc/pictures and start to make suggestions edits. This is based on Graham's initial branch with some quick cleanup on my end. Note, the PNGs are exports from our (sorry, internal at the moment) google doc. Once we have settled on images we can export the odp and put on to the repo as well...

Of note, I'd like to see someone write the section on the agent so it is of equivalent or more detail compared to what we had for hyperstart in 2.1. Not sure, but perhaps @amshinde would be well suited for this? Thoughts?

There are a few fixmes as well, which I figure will be easier to discuss during this RFC PR's review process.

Once we have input from folks, I'd recommend doing a giant squash and adding signed offs from the contributors.

Fixes #392

time initializing devices of no use for containers.
- Skipping the guest BIOS/firmware and jumping straight to the Clear Containers kernel.

#### Agent
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde ; can you help with this section?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egernst I'll take a look.

@clearcontainersbot
Copy link
Copy Markdown

Popular Images qa-passed 👍

multiplexes and demultiplexes those commands and streams for all container virtual machines.
There is only one `cc-proxy` instance running per Clear Containers host.

On the host, each container process is reaped by a Docker specific (`containerd-shim`) monitoring
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty docker specific. We may want to make this more generic so it can cover !docker as well.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be made more generic. At the moment, we always have a reaper between us and the higher layers of the stack. It's either containerd-shim (Docker) or conmon (CRI-O). So we could mention that and just basically replace containerd-shim with container process reaper.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

wants to run within an already running container (`docker exec`).

The container workload, i.e. the actual OCI bundle rootfs, is exported from the host to
the virtual machine via a 9pfs virtio mount point. Hyperstart uses this mount point as the root
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Hyperstart/agent

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Although Clear Containers can run with any recent QEMU release, containers boot time and memory
footprint are significantly optimized by using a specific QEMU version called [`qemu-lite`](https://github.com/clearcontainers/qemu/tree/qemu-lite-v2.9.0).

`qemu-lite` improvements comes through a new `pc-lite` machine type, mostly by:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using PC machine type for 3.0, I think for release perhaps it makes sense to just remove all of the permutations and just describe what we actually have?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Describe what we use by default, and what we support (q35, pc-lite). We should also explain why we go with pc by default.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a pass at this @ https://github.com/clearcontainers/runtime/wiki/Clear-Containers-Architecture#hypervisor - not sure how much detail we should go into features we aren't actually using though....

`qemu-lite` improvements comes through a new `pc-lite` machine type, mostly by:
- Removing many of the legacy hardware devices support so that the guest kernel does not waste
time initializing devices of no use for containers.
- Skipping the guest BIOS/firmware and jumping straight to the Clear Containers kernel.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really the case for 3.0 release, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 2.7, we're not skipping firmware with the pc machine type while we are with pc-lite. Right now our packages install 2.7 and since we switched to pc, we're indeed no longer skipping the firmware.
With 2.9 we could skip the firmware with the pc machine type, we're tracking why we haven't switched to 2.9 yet: clearcontainers/packaging#28

virtio I/O serial one).
3. Run all the [OCI hooks](https://github.com/opencontainers/runtime-spec/blob/master/config.md#hooks) in the container namespaces,
as described by the OCI container configuration file.
4. **fixme** [Set up the container networking](https://github.com/clearcontainers/runtime/blob/master/documentation/architecture.md#networking).
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixme is just for the link location, right @grahamwhaley ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it. I believe we can change it to a 'local reference', so just #networking should resolve to the section in this document.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@egernst
Copy link
Copy Markdown
Author

egernst commented Sep 12, 2017

@sameo @jodh-intel @mcastelino - if you have a chance to start taking a look. @jcvenegas perhaps we can reference the proxy protocol in the proxy section?

@iphutch -- early heads up and FYI that this is (in early stage of) the pipeline

This is an architectural overview of Clear Containers, based on the 3.0 release.

The [Clear Containers runtime (cc-runtime)](https://github.com/clearcontainers/runtime)
complies with the [OCI](https://github.com/opencontainers) specifications and thus
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "compatible" rather than "complies with".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree - 'compatible'

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


The container process is then spawned by an [agent](https://github.com/clearcontainers/agent),
running as a daemon on the guest operating system.
Hyperstart opens 2 virtio serial interfaces (Control and I/O) on the guest, and QEMU exposes them
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This should say "The agent", not Hyperstart now.
  • Small numbers should be spelt out in full, so can you change this to "two".
  • I'd say "in the guest" rather than "on the guest" to reinforce that we are talking about what is happening inside the VM.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

`cc-runtime` creates a QEMU/KVM virtual machine for each container the Docker engine creates.

The container process is then spawned by an [agent](https://github.com/clearcontainers/agent),
running as a daemon on the guest operating system.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add "inside the virtual machine" at the end of this sentence to be clearer here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

`stderr`, `stdin`) between the guest and the Docker Engine.

For any given container, both the init process and all potentially executed commands within that
container, together with their related I/O streams, need to go through 2 virtio serial interfaces
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2/two/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### Hypervisor

Clear Containers use [KVM](http://www.linux-kvm.org/page/Main_Page)/[QEMU](http://www.qemu-project.org/) to
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About it was QEMU/KVM so I'd stick with that ordering for consistency.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

For more details about `cc-proxy`'s protocol, theory of operations or debugging tips, please read
[`cc-proxy` README](https://github.com/clearcontainers/proxy).

#### Shim
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde - can you review this section please?

Clear Containers utilises the Linux kernel DAX (Direct Access filesystem)
feature to efficiently map some host side files into the guest VM space.
In particular, Clear Containers uses the `QEMU` nvdimm feature to provide a
memory mapped virtual device that can be used to DAX map the mini-OS root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the first mention of the mini-OS. I think it might be helpful to introduce it when we first start taking about the VM, pointing out that there is a mini-OS and the 9p-mapped docker image (ubuntu, busybox, etc).

file and device mapping mechanisms:

- Mapping as a direct access devices allows the guest to directly access
the memory pages (such as via eXicute In Place (XIP)), bypassing the guest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the memory pages/the host memory pages/ ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

host to be demand loaded using page faults, rather than having to make requests
via a virtualised device (causing expensive VM exits/hypercalls), thus providing
a speed optimisation.
- Utilising shmem MAP_SHARED on the host allows the host to efficiently
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/shmem MAP_SHARED/MAP_SHARED shared memory/ ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Information on the use of nvdimm via QEMU is available in the QEMU source code
[here](http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/nvdimm.txt;hb=HEAD)

### Previous releases
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be called "Architectural changes by release" or something? We can then document here what changed between 2.1 and 3.0.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


![QEMU/KVM](qemu.png)

** fixme - discuss different QEMUs - lite, q35, pc etc. **
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this release we can drop this 'fixme' and just mention qemu-lite.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also mention that we support several QEMU machine types.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

wants to run within an already running container (`docker exec`).

The container workload, i.e. the actual OCI bundle rootfs, is exported from the host to
the virtual machine via a 9pfs virtio mount point. Hyperstart uses this mount point as the root
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should mention that we go for virtio-blk if we find a block based graph driver.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

multiplexes and demultiplexes those commands and streams for all container virtual machines.
There is only one `cc-proxy` instance running per Clear Containers host.

On the host, each container process is reaped by a Docker specific (`containerd-shim`) monitoring
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be made more generic. At the moment, we always have a reaper between us and the higher layers of the stack. It's either containerd-shim (Docker) or conmon (CRI-O). So we could mention that and just basically replace containerd-shim with container process reaper.


![QEMU/KVM](qemu.png)

** fixme - discuss different QEMUs - lite, q35, pc etc. **
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also mention that we support several QEMU machine types.

Although Clear Containers can run with any recent QEMU release, containers boot time and memory
footprint are significantly optimized by using a specific QEMU version called [`qemu-lite`](https://github.com/clearcontainers/qemu/tree/qemu-lite-v2.9.0).

`qemu-lite` improvements comes through a new `pc-lite` machine type, mostly by:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Describe what we use by default, and what we support (q35, pc-lite). We should also explain why we go with pc by default.

by a set of namespaces (UTS, PID, mount and IPC). Although a pod can hold several containers,
`cc-runtime` always runs a single container per pod. **fixme** incorrect<--

**todo** add details on the agent protocol
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, explain that the agent is based on libcontainers, the runc library, and reference it.

1. `cc-runtime` connects to `cc-proxy` and sends it the `attach` command to let it know which pod
we want to use to run the `exec` command.
2. `cc-runtime` sends the allocateIO command to the proxy, for getting the `agent` I/O sequence
numbers for the `exec` command I/O streams.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has also changed with 3.0. We connect to the proxy and get a token back. We then create a shim with the given token and the shim connects to the proxy.

agent instance running in the appropriate guest.
3. After deleting the last running pod, the `agent` will gracefully shut the virtual machine
down.
4. `cc-runtime` sends the `BYE` command to `cc-proxy`, to let it know that a given virtual
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BYE has been replaced with UnregisterVM


- A UNIX, named socket for all `cc-runtime` instances on the host to send commands to `cc-proxy`.
- One socket pair per `cc-shim` instance, to send stdin and receive stdout and stderr I/O streams. See the
[cc-shim section](#shim)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has changed as well and is now simplified. There is one single socket (UNIX or TCP) for all runtimes and all shims.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcvenegas -- can you take a look and update?

Copy link
Copy Markdown
Contributor

@jcvenegas jcvenegas Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egernst I just check it is updated describing the interaction between proxy and shim in the new protocol probablyt @amshinde updated :D

the `AllocateIO` command to `cc-proxy` to have it request the `agent` to allocate those sequence numbers.
They will be passed as command line arguments to `cc-shim`, who will then use them to e.g. prepend its stdin
stream packets with the right sequence number.
- `Hyper`: This command is used by both `cc-runtime` and `cc-shim` to forward `agent` specific
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is well out of date as well for 3.0.

- `Hyper`: This command is used by both `cc-runtime` and `cc-shim` to forward `agent` specific
commands.

For more details about `cc-proxy`'s protocol, theory of operations or debugging tips, please read
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely have a link to it.

The [Clear Containers runtime (cc-runtime)](https://github.com/clearcontainers/runtime)
complies with the [OCI](https://github.com/opencontainers) specifications and thus
works seamlessly with the [Docker Engine](https://www.docker.com/products/docker-engine)
pluggable runtime architecture. In other words, one can transparently replace the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say replace. We can add another runtime. We cannot satisfy all possible types of containers like priv ones.

running as a daemon on the guest operating system.
Hyperstart opens 2 virtio serial interfaces (Control and I/O) on the guest, and QEMU exposes them
as serial devices on the host. `cc-runtime` uses the control device for sending container
management commands to the agent while the I/O serial device is used to pass I/O streams (`stdout`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to point to a section that contain the container management command or maybe the source file where the command protocol is defined.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added url to to godoc proxy api protocol, and added link to UML sequence diagram.

management commands to the agent while the I/O serial device is used to pass I/O streams (`stdout`,
`stderr`, `stdin`) between the guest and the Docker Engine.

For any given container, both the init process and all potentially executed commands within that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to document how exec is handled too.


The `agent` execution unit is the pod. An `agent` pod is a container sandbox defined
by a set of namespaces (UTS, PID, mount and IPC). Although a pod can hold several containers,
`cc-runtime` always runs a single container per pod. **fixme** incorrect<--
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate

cc-runtime always runs a single container per pod.


Here we will describe how `cc-runtime` handles the most important OCI commands.

##### `create`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to link to the source code file that implements create

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@egernst
Copy link
Copy Markdown
Author

egernst commented Sep 12, 2017

Moved this document to https://github.com/clearcontainers/runtime/wiki/Clear-Containers-Architecture to facilitate better collaboration at this point in the review process.

@egernst egernst requested a review from iphutch September 14, 2017 15:35
@egernst
Copy link
Copy Markdown
Author

egernst commented Sep 14, 2017

@iphutch - Can you start taking a look at this PR?

Copy link
Copy Markdown

@iphutch iphutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length and indentation are the biggies here. It's a good looking doc. See suggestions for other changes too.

is compatible with the [OCI](https://github.com/opencontainers) [runtime specification](https://github.com/opencontainers/runtime-spec)
and thus works seamlessly with the
[Docker Engine](https://www.docker.com/products/docker-engine) pluggable runtime
architecture. It also supports the [Kubernetes Container Runtime Interface (CRI)](https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/apis/cri/v1alpha1/runtime) through the [CRI-O](https://github.com/kubernetes-incubator/cri-o) implementation. In other words, one can transparently select between the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Line length
  • Docker, Kubernetes, and CRI-O need to be followed by an * in their first instance in the doc.
  • Let's avoid "one". Saying "you" is acceptable when referring to the reader/user:
    In other words, you can transparently...

![Runtime and virtcontainers](arch-images/runtime-vc-relationship.png)
![Docker and Clear Containers](arch-images/docker-cc.png)

`cc-runtime` creates a QEMU/KVM virtual machine for each container the Docker
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QEMU*


The [Clear Containers runtime (cc-runtime)](https://github.com/clearcontainers/runtime)
is compatible with the [OCI](https://github.com/opencontainers) [runtime specification](https://github.com/opencontainers/runtime-spec)
and thus works seamlessly with the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace/thus/therefore

to the container process on the guest and pass the container `stdout` and `stderr`
streams back up the stack to CRI-O or Docker via the container process reaper.
`cc-runtime` creates a `cc-shim` daemon for each container and for each OCI command
received to run within an already running container (i.e. `docker exec`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length on 66 and 68

`cc-runtime` creates a `cc-shim` daemon for each container and for each OCI command
received to run within an already running container (i.e. `docker exec`).

The container workload, i.e. the actual OCI bundle rootfs, is exported from the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid latin abbreviations:
The container workload, that is, the actual OCI bundle rootfs, is...

2. Get CNI plugin information

3. Start the plugin (providing previously created netns) to add a network
described into /etc/cni/net.d/ directory. At that time, the CNI plugin will
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use literal text for the directory name

5. Start VM inside the netns and start the container

## Storage
Container workloads are shared with the virtualized environment through 9pfs.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we link to information on 9pfs?


## DAX

Clear Containers utilises the Linux kernel DAX (Direct Access filesystem)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, can we link to DAX info here?:
Direct Access filesystem

share pages.

Clear Containers uses the following steps to set up the DAX mappings:
- QEMU is configured with an nvdimm memory device, with a memory file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use numbers instead of - here. Also, indentation

More information about DAX can be found in the Linux Kernel
[documentation](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt)

Information on the use of nvdimm via QEMU is available in the QEMU source code
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make "QEMU source code" the link and remove "here"

@egernst
Copy link
Copy Markdown
Author

egernst commented Sep 14, 2017

@iphutch -- Thanks! Will start to address these now...

- Connects to `cc-proxy` using a token obtained by calling the `cc-proxy` `ConnectShim` command. The token is passed from `cc-runtime` to `cc-shim` when the former spawns the latter and is used to identify the true container process that the shim process will be shadowing or representing.
- Fragments and encapsulates the standard input stream from the container process reaper into `cc-proxy` stream frames:
```
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde -- what's the intention for the above line?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question , I had asked the same to damien with no answer on that .
This is a direct copy of the frame format from his documented proxy protocol :)


- Moved from hyperstart to `cc-agent` as an agent inside the VM.
- Moved from `qemu-lite` to `pc` QEMU machine type.
- Rewrite of runtime in go, leveraging virtcontainers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a few items here:

  • virtio-blk for block based graph drivers
  • New simplified protocol between the shim, proxy and runtime
  • KSM throttling

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though if we bring up KSM throttlng, we should probably describe its setup in the proxy section, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown

@iphutch iphutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're a champ for getting the bulk of these indentation issues, a few more to go!

3. Call the prestart hook (from inside the netns)

4. Scan network interfaces inside netns and get the name of the interface
created by prestart hook
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation :( From here down. line 525, 538, 543, 592, 595,599, 651, and finally (fittingly) 666.

Copy link
Copy Markdown

@iphutch iphutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three things to fix and I'm happy to approve :)

[default Docker and CRI-O runtime (runc)](https://github.com/opencontainers/runc)
and `cc-runtime`.

![Runtime and virtcontainers](arch-images/runtime-vc-relationship.png)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .png isn't included in the PR or has a different name. It will bring a 404 error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y'ouch - git blame says that wa sfrom me, but I don't recall adding this, nor can I find this image anywehere... removing now...

the [Clear Containers QEMU repo](https://github.com/clearcontainers/qemu/tree/qemu-lite-v2.9.0).
This transition has been delayed until after the release of Clear Containers 3.0
due to regressions, as described in [runtime issue 407]
(https://github.com/clearcontainers/runtime/issues/407). Once support for
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking up this runtime issue 407 breaks the link's markup. Don't worry about going over 78chars when it's code or markup forcing you over:
due to regressions, as described in runtime issue 407.
Once support for

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to fix all those -- thanks! Done

## DAX

Clear Containers utilises the Linux kernel DAX [(Direct Access filesystem)]
(https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build error. Ignore line length rules for markup OR bring the whole link to 575.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops - thanks

Copy link
Copy Markdown

@iphutch iphutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes! Thumbs up from me.

## Agent

[`cc-agent`](https://github.com/clearcontainers/agent) is a daemon running in the
guest as a supervisor for managing containers and processes potentially running
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "potentially" is redundant so I'd drop it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - was just about to say the same thing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jodh-intel
Copy link
Copy Markdown

You might like to add in some Contributions-by: in the commit like we have here: 56e96d3.

Copy link
Copy Markdown
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing serious, mostly stylistic etc.


![Docker and Clear Containers](arch-images/docker-cc.png)

`cc-runtime` creates a QEMU\*/KVM virtual machine for each container the Docker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say 'for each container or pod', rather than just 'container' - as for k8s I believe we have one-VM per pod?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, agreed.

## Agent

[`cc-agent`](https://github.com/clearcontainers/agent) is a daemon running in the
guest as a supervisor for managing containers and processes potentially running
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - was just about to say the same thing.

- A control serial channel over which the `cc-agent` sends and receives specific
commands for controlling and managing pods and containers. Detailed information
about the commands can be found at [`cc-agent` API](https://github.com/clearcontainers/agent/tree/master/api).
- An I/O serial channel for passing the container processes output streams (stdout,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

earlier in the doc we backtick std[in|out|err] to stdin - we should consider doing that consistently through the document.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


1. Create the networking container namespace on the host, according to the container
OCI configuration file. We only support networking namespaces for now, but
will support more of them later.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of them/namespaces/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even s/more of them/other namespaces/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, done

2. Run all the [prestart OCI hooks](https://github.com/opencontainers/runtime-spec/blob/master/config.md#hooks)
in the host namespaces created in step 1, as described by the OCI container
configuration file.
3. [Set up the container networking namespace up](#networking). This is when
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Set up the container networking namespace up': Is that 'up' at the end a typo/redundant?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. done

`cc-proxy` and let it know that they stop monitoring their container process.

For more details about `cc-proxy`'s protocol, theory of operations or debugging
tips, please read
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/read/read the/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


## Shim

A container process reaper, such as Docker's `containerd-shim` or crio's `conmon`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/crios/CRI-O's/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


`cc-shim` has an implicit knowledge about which VM agent will handle those streams
and signals and thus acts as an encapsulation layer between the container process
reaper and the `cc-agent`. `cc-shim`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put cc-shim: on a new paragraph, and expand a little to something like 'cc-shim performs the following steps:' or similar.

- Fragments and encapsulates the standard input stream from the container process
reaper into `cc-proxy` stream frames:
```
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think @egernst and @amshinde have noted - the '1 1 1 1 ' line seems redundant and makes no sense. Yes, we queried this on the original doc, and I have a feeling the answer was 'oops, a copy/paste issue' :-)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note - if anything then I think it would have been denoting byte lanes - but it is failing at that on many fronts right now. It might be nice to show it as byte lanes though (1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it is adjusted, it is showing bit number, the 1's aren't redundant, they are the decimal MSB for each bit. Turn your head clockwise when you look at it :)

Users can check to see if the container uses devicemapper block device as its
rootfs by calling `mount(8)` within the counter. If devicemapper block device
is used, '/' will be mounted on `/dev/vda`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could note that devicemapper block device mode can be disabled in the runtime config file if necessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@grahamwhaley
Copy link
Copy Markdown
Contributor

This is looking really good - a visible improvement on the already rather good CC2.x document - so, kudos to everybody involved in the update.

@clearcontainers clearcontainers deleted a comment from iphutch Sep 15, 2017
@clearcontainers clearcontainers deleted a comment from iphutch Sep 15, 2017
This is an architectural overview of Clear Containers, based on the 3.0 release.

The [Clear Containers runtime (cc-runtime)](https://github.com/clearcontainers/runtime)
is compatible with the [OCI](https://github.com/opencontainers) [runtime specification](https://github.com/opencontainers/runtime-spec)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iphutch - does OCI need an asterisk here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will need an asterisk here. Good catch.


In the future, Clear Containers plan to move to a 2.9 based version of QEMU,
available at
the [Clear Containers QEMU repo](https://github.com/clearcontainers/qemu/tree/qemu-lite-v2.9.0).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/repo/repository/.

within those containers.

The `cc-agent` execution unit is the pod. A `cc-agent` pod is a container sandbox
defined by a set of namespaces (NS, UTS, IPC and PID). `cc-runtime` can run several
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NS/mount, net, cgroup/ I think @sameo?

the [Clear Containers QEMU repo](https://github.com/clearcontainers/qemu/tree/qemu-lite-v2.9.0).
This transition has been delayed until after the release of Clear Containers 3.0
due to regressions, as described in [runtime issue 407](https://github.com/clearcontainers/runtime/issues/407).
Once support for features like hotplug are available in `Q35`, the project will
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Q35/q35/ for consistency with all other mentions of machine types.

Most users will not need to modify the configuration file.

The file is well commented and provides a few "knobs" that can modify the
behaviour of the runtime.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this but on re-reading, I think it would be better to say, "that can be used to modify the behaviour of the runtime."

5. Start VM inside the netns and start the container

## Storage
Container workloads are shared with the virtualized environment through [9pfs](https://www.kernel.org/doc/Documentation/filesystems/9p.txt).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/through 9pfs/using the 9pfs filesystem/

Clear Containers utilises the Linux kernel DAX [(Direct Access filesystem)](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt)
feature to efficiently map some host side files into the guest VM space.
In particular, Clear Containers uses the `QEMU` nvdimm feature to provide a
memory mapped virtual device that can be used to DAX map the virtual machine's
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyphenate "memory-mapped"?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally try to avoid hyphenation but I'm on board with memory-mapped and host-side.

feature to efficiently map some host side files into the guest VM space.
In particular, Clear Containers uses the `QEMU` nvdimm feature to provide a
memory mapped virtual device that can be used to DAX map the virtual machine's
root filesystem into the guest space.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/guest space/guest memory address space/

host to be demand loaded using page faults, rather than having to make requests
via a virtualised device (causing expensive VM exits/hypercalls), thus providing
a speed optimisation.
- Utilising MAP_SHARED shared memory on the host allows the host to efficiently
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/MAP_SHARED/MAP_SHARED/


Clear Containers utilises the Linux kernel DAX [(Direct Access filesystem)](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt)
feature to efficiently map some host side files into the guest VM space.
In particular, Clear Containers uses the `QEMU` nvdimm feature to provide a
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove backticks around QEMU.

@grahamwhaley
Copy link
Copy Markdown
Contributor

I think we ( @iphutch ) may have mentioned the high-level-overview.png is not referenced anywhere - to note, it was referenced from the proxy README.md in the cc2.x repo, and thus I believe has been un-necessarily carried over (and there is a copy over in the clearcontainers/proxy/docs dir I think). Yes, it can be dropped from here.


The `delete` code path differs significantly between having to delete one container
inside a pod (as is typical in Docker) and having to delete an entire pod (which
is unique to Kubernetes). In the former case, `cc-runtime` will only send a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than 'is unique to' I might say 'such as from'.

The `delete` code path differs significantly between having to delete one container
inside a pod (as is typical in Docker) and having to delete an entire pod (which
is unique to Kubernetes). In the former case, `cc-runtime` will only send a
`SIGKILL` signal to the container process. In the latter case, the whole thing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/whole thing/whole pod/ or something similar - 'all components' or something

further work to do:
 -upload text used to create UMLs
 -add UML for other key OCI commands
 -add crio/conmon/k8s diagram for parity w/ docker
 -Describe KSM throttling feature in the proxy or its own section
  once the feature lands.

Contributions-by: Graham Whaley <graham.whaley@intel.com>
Contributions-by: James O. D. Hunt <james.o.hunt@intel.com>
Contributions-by: Samuel Ortiz <sameo@linux.intel.com>
Contributions-by: Archana Shinde <archana.m.shinde@intel.com>
Contributions-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@mcastelino mcastelino merged commit c9ccbb7 into clearcontainers:master Sep 15, 2017
@ericwadams
Copy link
Copy Markdown

I recommend adding a Kubernetes overall architecture guide including the shim/proxy/agent and how it works with crio.

mcastelino pushed a commit to mcastelino/runtime that referenced this pull request Dec 6, 2018
To use the filepath.Join() instead of the simple
string append method to form the file path, otherwise
it will lose the "/" between the two parts.

Fixes clearcontainers#543.

Signed-off-by: Fupan Li <lifupan@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create an architecture document