TRITON-9: affinity is racy

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-04-05T15:44:33.136Z)

Fix Versions

2019-04-11 Pete Hornberger (Release Date: 2019-04-11)

Related Issues

Related Links

Description

It seems to be that if you use docker-compose scale container=number_of_instances, then compose fires off parallel docker requests and alerts you when each of them has completed. In this case you can see the 5th container is up before the 4th.
ravio:sdc-docker112 $ docker-compose scale consul=3
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
ravio:sdc-docker112 $ docker-compose scale consul=5
Creating and starting sdcdocker112_consul_4 ...
Creating and starting sdcdocker112_consul_5 ... done

When you have a compose file that looks something like the following, the affinity rules have a chance of being missed. In this scenario the containers end up on the same CN instead of being spread out like the user wants.
consul:
    image: autopilotpattern/consul:latest
    restart: always
    mem_limit: 128m
    ports:
      - 8500
    dns:
      - 127.0.0.1
    labels:
      - triton.cns.services=jenkins-consul
      - com.example.type=jenkins-consul
      - com.docker.swarm.affinities=["com.example.type!=jenkins-consul"]
    command: >
      /usr/local/bin/containerpilot
      /bin/consul agent -server
        -bootstrap-expect 3
        -config-dir=/etc/consul
        -ui-dir /ui

In /lib/backends/sdc/containers.js, we can see createContainer will make a call to buildVmPayload which in turn calls localityFromContainer (affinity.js). Since compose fires off multiple docker requests to sdc-docker, we likely have a race in vmapi calls when looking for the relative parameters.

Theres been a report of this here https://github.com/joyent/sdc-docker/issues/112

Comments

Comment by Michael Zeller
Created at 2017-04-20T03:33:55.000Z
I am going to try and reproduce this.

Comment by Michael Zeller
Created at 2017-04-20T03:50:31.000Z
Updated at 2017-04-20T03:53:07.000Z
In east-3b the last two containers ended up on the same CN.

ravio:sdc-docker112 $ docker-compose down; docker-compose scale consul=5; triton insts -H -o name,compute_node | sort | uniq -c | sort | grep _consul
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_4 ... done
Creating and starting sdcdocker112_consul_5 ... done
   1 sdcdocker112_consul_1     44454c4c-4400-1054-8052-b5c04f383432
   1 sdcdocker112_consul_2     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_3     44454c4c-4400-1046-8050-b5c04f383432
   1 sdcdocker112_consul_4     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_5     44454c4c-4400-1043-8053-b5c04f383432

Another run had similar results
ravio:sdc-docker112 $ docker-compose down; docker-compose scale consul=5; triton insts -H -o name,compute_node | sort | uniq -c | sort | grep _consul
Stopping sdcdocker112_consul_1 ... done
Stopping sdcdocker112_consul_2 ... done
Stopping sdcdocker112_consul_5 ... done
Stopping sdcdocker112_consul_4 ... done
Stopping sdcdocker112_consul_3 ... done
Removing sdcdocker112_consul_1 ... done
Removing sdcdocker112_consul_2 ... done
Removing sdcdocker112_consul_5 ... done
Removing sdcdocker112_consul_4 ... done
Removing sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_4 ... done
Creating and starting sdcdocker112_consul_5 ... done
   1 sdcdocker112_consul_1     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_2     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_3     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_4     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_5     44454c4c-5400-1034-804d-b5c04f383432

Comment by Josh Wilsdon
Created at 2017-04-20T06:59:28.000Z
Updated at 2018-05-07T17:51:35.437Z
My understanding is that the affinity rules are best-effort. Especially in east-3b this is really not surprising to me.

Would you have expected some of these provisions to have failed instead? We might be able to turn this into a hard filter such that two of these provisions would have failed instead of being co-located.

UPDATE: looking further at https://github.com/joyent/sdc-docker/blob/master/docs/api/features/placement.md it does seem as though those other two probably should have failed. I'll try to look into why they didn't. A race does indeed seem like it'd be possible here.

Comment by Michael Zeller
Created at 2017-04-20T14:27:45.000Z
Updated at 2017-04-20T14:29:13.000Z
"My understanding is that the affinity rules are best-effort."
That doesn't seem right. We support both hard and soft affinity rules. Hard meaning that if the condition is not met we fail the provision, while soft is best-effort.

The user who reported the issue on github has a total of 4 CN's and all of his provisions land on the same CN. However, if they provision them one at a time instead, the containers will end up spread across the CN's as expected.

Comment by Trent Mick
Created at 2017-04-20T20:53:25.000Z
Updated at 2017-04-20T22:32:25.000Z
They aren't best-effort. That's only for the "soft" affinities using "~" in the operator.

@Smithx10 on the GH ticket said:

> After reading through ./lib/backends/sdc/affinity.js, I think we are only checking what is currently running and handing it off to DAPI.

They are correct. The translation from (a) tag (which swarm affinity rule syntax supports, e.g. "role=web") to VM UUIDs (which is all current DAPI locality rules support) and (b) from VM UUID patterns is done in the sdc-docker engine before we get to the processing in the provision workflow that serializes server selection (https://github.com/joyent/sdc-vmapi/blob/e4cfd962384a45038ad823ab197ca9b375d7d94b/lib/workflows/provision.js#L523-L541).

Affinity via CloudAPI and the `triton` CLI aren't affected because they don't yet support affinity via tag, nor affinity via an instance name glob.

fixing this

The only good fix I see here is to update DAPI's locality hints processing to support the tag syntax and the VM name pattern syntax (e.g. "container==foo*", "container==/^web/"). We could then also consider having DAPI support the full affinity syntax, i.e. also supporting "container=(short docker id)".


That might be interesting because of DAPI's recent batching of candidate servers. We'd perhaps want DAPI to first hit VMAPI to gather relevant information for tag affinities... that might reduce the server consideration set.

Comment by Trent Mick
Created at 2017-09-25T23:42:46.000Z
Moved to TRITON because this affects both cloudapi (after PUBAPI-1428) and sdc-docker.

Comment by Marsell Kukuljevic
Created at 2019-01-09T23:07:10.070Z
Another ticket: https://github.com/joyent/sdc-designation/issues/2

Comment by Marsell Kukuljevic
Created at 2019-03-05T16:28:40.711Z
I forgot to update the ticket with its current status.

Comment by Trent Mick
Created at 2019-03-19T16:59:54.632Z
Marsell, where is there a description of what the 5 CRs are doing to fix this issue? Is it an impl of my "fixing this" comment above?

Comment by Marsell Kukuljevic
Created at 2019-03-19T18:11:39.135Z
Docker and Cloudapi both parse affinities like they used to, converting the parse into an intermediate representation:

{
    "key": "instance" | "container" | <tag string>,
    "operator": "!=" | "==",
    "value": <string>,
    "valueType": "exact" | "glob" | "re",
    "isSoft": true | false
}

An array of the above objects are passed from docker/cloudapi to vmapi, which is sent by the workflow through cnapi into dapi. dapi converts the affinities into locality hints, which a later stage in dapi filters out.

An extra stage was also added to the workflow between the dapi waitlist tickets which updates the VM object in vmapi with a server_uuid (needed by other concurrent affinity/locality allocations). This update doesn't use etags though, so that might need to be changed.

Comment by Trent Mick
Created at 2019-03-19T18:47:21.858Z
@marsell @rui.loura Will these changes at those for TRITON-764 need to consider each other?

Comment by Rui Loura
Created at 2019-03-19T19:07:21.739Z
@trent.mick & @marsell I dont think the changes to sdc-vmapi:lib/workflows/provision.js will be necessary once I put back TRITON-764.

Comment by Jira Bot
Created at 2019-04-05T15:29:29.469Z
sdc-designation commit f2b4e8dac18b400a64fedb27db869b8d32ef009b (branch master, by Marsell Kukuljevic)

TRITON-9 affinity is racy
Reviewed by: Pedro Palazón Candel <pedro@joyent.com>
Approved by: Pedro Palazón Candel <pedro@joyent.com>


Comment by Jira Bot
Created at 2019-04-05T15:41:34.803Z
sdc-cnapi commit 72c1664f7b3fb896b23d0268467a7bac08cd8087 (branch master, by Marsell Kukuljevic)

TRITON-9 affinity is racy
Reviewed by: Pedro Palazón Candel <pedro@joyent.com>
Reviewed by: Orlando Vazquez <orlando@joyent.com>
Approved by: Pedro Palazón Candel <pedro@joyent.com>
Approved by: Orlando Vazquez <orlando@joyent.com>


Comment by Jira Bot
Created at 2019-04-05T15:42:41.505Z
sdc-cloudapi commit 0a07ee22a45ea51b8229f54ef998aa47eb9d3e66 (branch master, by Marsell Kukuljevic)

TRITON-9 affinity is racy
Reviewed by: Pedro Palazón Candel <pedro@joyent.com>
Approved by: Pedro Palazón Candel <pedro@joyent.com>


Comment by Jira Bot
Created at 2019-04-05T15:43:59.255Z
sdc-docker commit ca600b1d450606310a3bc05ea3573aacd0e6c826 (branch master, by Marsell Kukuljevic)

TRITON-9 affinity is racy
Reviewed by: Orlando Vazquez <orlando@joyent.com>
Reviewed by: Pedro Palazón Candel <pedro@joyent.com>
Approved by: Orlando Vazquez <orlando@joyent.com>