If you want basic resiliency around your DC/OS master nodes when hosting them on AWS, you’ll want to have Exhibitor store its data in AWS S3. In order to do so, you’ll want to grant S3 IAM roles to your master nodes so they may talk to a specific S3 bucket. Then, you use this genconf/config.yaml
config to install (or re-install) your DC/OS cluster:
exhibitor_storage_backend: aws_s3 aws_region: us-east-1 exhibitor_explicit_keys: false s3_bucket: <bucket-name> s3_prefix: my-dcos-exhibitor-file
Note: You do NOT need to have your master nodes use a load balancer (master_discovery: master_http_loadbalancer
) for discovery even if you decide to use S3 for exhibitor backend. Yes, they’re often used together, but it’s not mandatory to use them together.
Back to your installation, assuming you follow instructions here to continue the installation. Once everything is deployed, head to your S3 bucket, search for and open up the my-dcos-exhibitor-file
. Here, two things can happen:
- You find the file doesn’t exist. In this case, take a look at your
genconf/config.yaml
file and count the number of master nodes that you listed in there. Here’s my list of masters:master_list: - 10.0.2.34 - 10.0.0.147 - 10.0.4.247
If you have less than 3 master nodes in your list, then Exhibitor defaults to using a “static” exhibitor backend, and won’t use S3 to store its config. So, use 3 or more nodes and reinstall.
- The second issue that can happen is that only one of the master nodes succeeds to write to S3 into the
my-dcos-exhibitor-file
and you’re left with a broken cluster. Your services (#systemctl | grep dcos
) will all fail and your postflight will timeout and fail(#sudo bash dcos_generate_config.sh --postflight
). You may also see tons of “null-lock-*
” files hanging out in your S3 bucket:If this is your case, go checkout the
my-dcos-exhibitor-file
from S3. If you see something like this, there may be something I can do to help:#Auto-generated by Exhibitor 10.0.0.163 #Wed Dec 14 19:49:59 UTC 2016 com.netflix.exhibitor-rolling-hostnames= com.netflix.exhibitor-rolling.zookeeper-data-directory=/var/lib/dcos/exhibitor/zookeeper/snapshot com.netflix.exhibitor-rolling.servers-spec=2\:10.0.0.163 com.netflix.exhibitor.zookeeper-pid-path=/var/lib/dcos/exhibitor/zk.pid com.netflix.exhibitor.java-environment= com.netflix.exhibitor.zookeeper-data-directory=/var/lib/dcos/exhibitor/zookeeper/snapshot com.netflix.exhibitor-rolling-hostnames-index=0 com.netflix.exhibitor-rolling.java-environment= com.netflix.exhibitor-rolling.observer-threshold=0 com.netflix.exhibitor.servers-spec=2\:10.0.0.163 com.netflix.exhibitor.cleanup-period-ms=300000 com.netflix.exhibitor.zookeeper-config-directory=/var/lib/dcos/exhibitor/conf com.netflix.exhibitor.auto-manage-instances-fixed-ensemble-size=3 com.netflix.exhibitor.zookeeper-install-directory=/opt/mesosphere/active/exhibitor/usr/zookeeper com.netflix.exhibitor.check-ms=30000 com.netflix.exhibitor.zookeeper-log-directory=/var/lib....
If you look at the highlighted lines, you may notice something. What happened to your other master nodes, you ask? Well, I don’t have an answer, but there’s a workaround.
Edit those two (highlighted in red) lines to include all your servers. Make sure to give them an id.
com.netflix.exhibitor-rolling.servers-spec=2\:10.0.0.163,1\:10.0.4.50,3\:10.0.2.174 com.netflix.exhibitor.servers-spec=2\:10.0.0.163,1\:10.0.4.50,3\:10.0.2.174
Now, give your entire cluster a few minutes while all master nodes stop being asshats and start to discover each other. Once Exhibitor is happy, DC/OS stops being whiny. All your services will be up, and you’ll soon be able to login to your DC/OS UI.
I hope that was helpful. I wasted an entire day (well I got paid to do it) trying to figure this out.