DC/OS Exhibitor on S3 – Issues & Workarounds

If you want basic resiliency around your DC/OS master nodes when hosting them on AWS, you’ll want to have Exhibitor store its data in AWS S3. In order to do so, you’ll want to grant S3 IAM roles to your master nodes so they may talk to a specific S3 bucket. Then, you use this genconf/config.yaml config to install (or re-install) your DC/OS cluster:

exhibitor_storage_backend: aws_s3
aws_region: us-east-1
exhibitor_explicit_keys: false
s3_bucket: <bucket-name>
s3_prefix: my-dcos-exhibitor-file

Note: You do NOT need to have your master nodes use a load balancer (master_discovery: master_http_loadbalancer ) for discovery¬†even if you decide to use S3 for exhibitor backend. Yes, they’re often used together, but it’s not mandatory to use them together.

Back to your installation, assuming you follow instructions here to continue the installation. Once everything is deployed, head to your S3 bucket, search for and open up the my-dcos-exhibitor-file. Here, two things can happen:

  1. You find the file doesn’t exist. In this case, take a look at your genconf/config.yaml file and count the number of master nodes that you listed in there. Here’s my list of masters:

    If you have less than 3 master nodes in your list, then Exhibitor defaults to using a “static” exhibitor backend, and won’t use S3 to store its config. So, use 3 or more nodes and reinstall.

  2. The second issue that can happen is that only one of the master nodes succeeds to write to S3 into the my-dcos-exhibitor-file and you’re left with a broken cluster. Your services ( #systemctl | grep dcos ) will all fail and your postflight will timeout and fail( #sudo bash dcos_generate_config.sh --postflight ). You may also see tons of “null-lock-*” files hanging out in your S3 bucket:screenshot-at-2016-12-14-154748

    If this is your case, go checkout the my-dcos-exhibitor-file from S3. If you see something like this, there may be something I can do to help:

    #Auto-generated by Exhibitor
    #Wed Dec 14 19:49:59 UTC 2016

    If you look at the highlighted lines, you may notice something. What happened to your other master nodes, you ask? Well, I don’t have an answer, but there’s a workaround.

    Edit those two (highlighted in red) lines to include all your servers. Make sure to give them an id.


    Now, give your entire cluster a few minutes while all master nodes stop being asshats and start to discover each other. Once Exhibitor is happy, DC/OS stops being whiny. All your services will be up, and you’ll soon be able to login to your DC/OS UI.

    I hope that was helpful. I wasted an entire day (well I got paid to do it) trying to figure this out.

This entry was posted in Amazon Web Services, Linux, Tech. and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s