TRITON-9: affinity is racy

Description

It seems to be that if you use docker-compose scale container=number_of_instances, then compose fires off parallel docker requests and alerts you when each of them has completed. In this case you can see the 5th container is up before the 4th.
ravio:sdc-docker112 $ docker-compose scale consul=3
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
ravio:sdc-docker112 $ docker-compose scale consul=5
Creating and starting sdcdocker112_consul_4 ...
Creating and starting sdcdocker112_consul_5 ... done

When you have a compose file that looks something like the following, the affinity rules have a chance of being missed. In this scenario the containers end up on the same CN instead of being spread out like the user wants.
consul:
    image: autopilotpattern/consul:latest
    restart: always
    mem_limit: 128m
    ports:
      - 8500
    dns:
      - 127.0.0.1
    labels:
      - triton.cns.services=jenkins-consul
      - com.example.type=jenkins-consul
      - com.docker.swarm.affinities=["com.example.type!=jenkins-consul"]
    command: >
      /usr/local/bin/containerpilot
      /bin/consul agent -server
        -bootstrap-expect 3
        -config-dir=/etc/consul
        -ui-dir /ui

In /lib/backends/sdc/containers.js, we can see createContainer will make a call to buildVmPayload which in turn calls localityFromContainer (affinity.js). Since compose fires off multiple docker requests to sdc-docker, we likely have a race in vmapi calls when looking for the relative parameters.

Theres been a report of this here https://github.com/joyent/sdc-docker/issues/112

Comments

Comment by Michael Zeller
Created at 2017-04-20T03:33:55.000Z
I am going to try and reproduce this.

Comment by Michael Zeller
Created at 2017-04-20T03:50:31.000Z
Updated at 2017-04-20T03:53:07.000Z
In east-3b the last two containers ended up on the same CN.

ravio:sdc-docker112 $ docker-compose down; docker-compose scale consul=5; triton insts -H -o name,compute_node | sort | uniq -c | sort | grep _consul
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_4 ... done
Creating and starting sdcdocker112_consul_5 ... done
   1 sdcdocker112_consul_1     44454c4c-4400-1054-8052-b5c04f383432
   1 sdcdocker112_consul_2     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_3     44454c4c-4400-1046-8050-b5c04f383432
   1 sdcdocker112_consul_4     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_5     44454c4c-4400-1043-8053-b5c04f383432

Another run had similar results
ravio:sdc-docker112 $ docker-compose down; docker-compose scale consul=5; triton insts -H -o name,compute_node | sort | uniq -c | sort | grep _consul
Stopping sdcdocker112_consul_1 ... done
Stopping sdcdocker112_consul_2 ... done
Stopping sdcdocker112_consul_5 ... done
Stopping sdcdocker112_consul_4 ... done
Stopping sdcdocker112_consul_3 ... done
Removing sdcdocker112_consul_1 ... done
Removing sdcdocker112_consul_2 ... done
Removing sdcdocker112_consul_5 ... done
Removing sdcdocker112_consul_4 ... done
Removing sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_1 ... done
Creating and starting sdcdocker112_consul_2 ... done
Creating and starting sdcdocker112_consul_3 ... done
Creating and starting sdcdocker112_consul_4 ... done
Creating and starting sdcdocker112_consul_5 ... done
   1 sdcdocker112_consul_1     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_2     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_3     44454c4c-4400-1043-8053-b5c04f383432
   1 sdcdocker112_consul_4     44454c4c-5400-1034-804d-b5c04f383432
   1 sdcdocker112_consul_5     44454c4c-5400-1034-804d-b5c04f383432

Comment by jwilsdon#1
Created at 2017-04-20T06:59:28.000Z
Updated at 2017-04-20T07:06:26.000Z
My understanding is that the affinity rules are best-effort. Especially in east-3b this is really not surprising to me.

Would you have expected some of these provisions to have failed instead? We might be able to turn this into a hard filter such that two of these provisions would have failed instead of being co-located.

UPDATE: looking further at https://github.com/joyent/sdc-docker/blob/master/docs/api/features/placement.md it does seem as though those other two probably should have failed. I'll try to look into why they didn't. A race does indeed seem like it'd be possible here.

Comment by Michael Zeller
Created at 2017-04-20T14:27:45.000Z
Updated at 2017-04-20T14:29:13.000Z
"My understanding is that the affinity rules are best-effort."
That doesn't seem right. We support both hard and soft affinity rules. Hard meaning that if the condition is not met we fail the provision, while soft is best-effort.

The user who reported the issue on github has a total of 4 CN's and all of his provisions land on the same CN. However, if they provision them one at a time instead, the containers will end up spread across the CN's as expected.

Comment by Trent Mick
Created at 2017-04-20T20:53:25.000Z
Updated at 2017-04-20T22:32:25.000Z
They aren't best-effort. That's only for the "soft" affinities using "~" in the operator.

@Smithx10 on the GH ticket said:

> After reading through ./lib/backends/sdc/affinity.js, I think we are only checking what is currently running and handing it off to DAPI.

They are correct. The translation from (a) tag (which swarm affinity rule syntax supports, e.g. "role=web") to VM UUIDs (which is all current DAPI locality rules support) and (b) from VM UUID patterns is done in the sdc-docker engine before we get to the processing in the provision workflow that serializes server selection (https://github.com/joyent/sdc-vmapi/blob/e4cfd962384a45038ad823ab197ca9b375d7d94b/lib/workflows/provision.js#L523-L541).

Affinity via CloudAPI and the `triton` CLI aren't affected because they don't yet support affinity via tag, nor affinity via an instance name glob.

fixing this

The only good fix I see here is to update DAPI's locality hints processing to support the tag syntax and the VM name pattern syntax (e.g. "container==foo*", "container==/^web/"). We could then also consider having DAPI support the full affinity syntax, i.e. also supporting "container=(short docker id)".


That might be interesting because of DAPI's recent batching of candidate servers. We'd perhaps want DAPI to first hit VMAPI to gather relevant information for tag affinities... that might reduce the server consideration set.

Comment by Trent Mick
Created at 2017-09-25T23:42:46.000Z
Moved to TRITON because this affects both cloudapi (after PUBAPI-1428) and sdc-docker.