Work from home, is it working?

04 Mar 2017

The theme of the week has been geography. Listening to Startup CEO, in which Matt Blumberg states strongly, that having a central office for as long as you can is important for that human connection, serendipitous conversations and culture. This is in contrast to virtual or home working. How connected are we to physical geography and at the same time, connected to a virtual world?

(Read more...)

Do you have Fun at Work?

25 Feb 2017

What sort of question is that? I don’t know about you, but I feel privileged and happy that I enjoy what I do at work, and would even go as far to call it fun. I want to be remembered for making work fun for everyone.

(Read more...)

Security Groups

18 Feb 2017

I ran into a couple of problems this week involving security groups A security group acts like a virtual firewall on your instance. It controls what traffic enters and leaves and is attached to an instance on start.

(Read more...)

Getting authenticated with Mongo

11 Feb 2017

The challenge this week was to find out why the authentication appeared to be broken on the automated mongodb build. Several weeks ago I had written a puppet module to build a mongodb cluster using a number of arguments, like number of nodes, nodenames, certificates, etc. Despite having certificates generated from a CA (Certificate Authority), and the certificate with the client to log on, this user could do anything. and .auth() was not needed.

mongo admin --ssl --sslCAFile /etc/mongodb/ssl/mongoCA.pem \
    --sslPEMKeyFile /etc/mongodb/ssl/mongo1.pem \
    -u mongoReadony -p mongotest --host mongo1

In the /etc/mongod.conf file, security clusterAuthMode: x509 was set, but security.authorization: was disabled It was assumed that specifying net.ssl.mode was enough and the security.authorization setting would be ignored. Sorry, false assumption.

(Read more...)

The Importance of testing backups

02 Feb 2017

Another incidence of a tired admin fixing an outage to cause a bigger outage isn't news as such, however I have to hand it to gitlab with their open honesty about this weeks incident.

After a spam storm created serious (4GB) replication lag on the firms postgresql database cluster, to fix the replication a very very tired on-call team-member then deleted the data folder on the active rather than the replicating server.

The full incident is documented here

I embrace the honesty that they have shown as this enables the whole community to learn from this and offer better services to our clients. This is very much the message in Black Box Thinking by Matthew Syed. Matthew describes the difference between closed cultures where mistakes are hidden vs an open hostest culture where mistakes are open and much learning and prevention occurs as a result.

As shown by the support on Twitter the DevOps and cloud reliability engineers agree.

Lessons so far? Test your backups, you never know when you will really need them.

With my ethos about servers being disposible, I love destroying and rebuilding servers, to prove in any Disaster Recovery situation, the service can be restored. This relies on well designed recovery processes and code, keeping the focus away from avoiding failure, to focus on embracing failure and reducing the mean time to recovery.

(Read more...)