Sep 4
Managing the Unexpected
icon1 Darrell Mozingo | icon2 Books | icon4 September 4th, 2013| icon3No Comments »

I recently read Managing the Unexpected. It’s a brilliant book about running highly resilient organisations. While it’s mostly based on high-risk organisations like nuclear power plants and wild fire firefighting units, it’s still highly applicable to any company just trying to increase their resiliency to failures and outages.

A lot of the points in the book fall into that “that sounds so obvious” category after you read it, but I think those are the best kind as they help clarify information you weren’t able to and give you a good way to communicate them with your colleagues. Still plenty in there to give you something new to think about too. The first half of the book discusses five principals they feel all highly resilient organisations need to follow, while the second half goes over ways to introduce them to your organisation, complete with rating systems for how you function now.

The five main principals the book harps on are (the first three are for avoiding incidents, while the last two are for dealing with them when they occur):

  • Tracking small failures – don’t let errors slip through the cracks and go unnoticed.
  • Resisting oversimplification – don’t simply write off errors as “looking like the same one we see all the time”, but investigate them.
  • Remaining sensitive to operations – employees working on the front line are more likely to notice something out of the ordinary, which could indicate an impending failure. Listen to them.
  • Maintaining capabilities for resilience – shy away from removing things that’ll keep resilience in your system when there’s an outage.
  • Taking advantage of shifting locations of expertise – don’t leave all decision making power in the hands of managers that may be separated from the incident. Let front line members call the shots.

Here’s some of my favourite bits of wisdom from the book:

  • “… try to hold on to those feelings and resist the temptation to gloss over what has just happened and treat it as normal. In that brief interval between surprise and successful normalizing lies one of your few opportunities to discover what you don’t know. This is one of those rare moments when you can significantly improve your understanding. If you wait too long, normalizing will take over, and you’ll be convinced that there is nothing to learn.” (pg 31) There’s been too many times in the past I’ve been involved in system outages where everyone goes into panic mode, gets the problem solves, but then sits around afterwards going “yea, it was just because of that usual x or y issue that we know about”. It’s about digging in and never assuming a failure was because of a known situation (lying to yourself). Dig in and find out what happened with a blank slate after each failure. Keep asking why.
  • “Before an event occurs, write down what you think will happen. Be specific. Seal the list in an envelope, and set it aside. After the event is over, reread your list and assess where you were right and wrong.” (pg 49) Basically following the scientific method. Setup a null hypothesis with expectations that you can check after an event (software upgrade, new feature, added capacity, etc). It’s definitely not something I’m used to, but trying to build it into my work flow. I love the idea of Etsy’s Catapult tool where they setup expectations for error rates, client retention, etc before releasing a feature, then do A/B testing to show it met or failed each criteria.
  • “Resilience is a form of control. ‘A system is in control if it is able to minimize or eliminate unwanted variability, either in its own performance, in the environment, or in both… The fundamental characteristic of a resilient organization is that it does not lose control of what it does but is able to continue and rebound.’” (pg 70) – Don’t build highly resilient applications assuming they’ll never break, but instead assume that each and every piece will break or slow down at some point (even multiple together) and design your app to deal with it. We’ve built our streaming platform to assume everything will break, even our dependencies on other internal teams, and we’ll just keep going as best we can when they’re down and bounce back after.
  • “Every unexpected event has some resemblance to previous events and some novelty relative to previous events. [...] The resilient system bears the marks of its dealings with the unexpected not in the form of more elaborate defences but in the form of more elaborate response capabilities.” (pg 72) – When you have an outage and determine the root cause, don’t focus on fixing that one specific error from ever happening again. Instead, try to build resilience into the system to stop that class of problem from having affects in the future. If your cache throwing a specific error was the root cause, for instance, build the system to handle any error from the cache rather than that specific one, and increase metrics around these to respond faster in the future.
  • “Clarify what constitutes good news. Is no news good news, or is no news bad news? Don’t let this remain a question. Remember, no news can mean either that things are going well or that someone is [...] unable to give news, which is bad news. Don’t fiddle with this one. No news is bad news.” (pg 152) – If your alerting system hasn’t made a peep for a few days, it’s probably a bad thing. Some nominal level of errors will always be common, and if you’re hearing nothing it’s an error. Never assuming your monitoring and alerting systems are working smoothly!

Overall the book is an excellent read. A bit dense in writing style at time, but I’d recommend it if you’re working on a complex system that demands uptime in the face of shifting requirements and operating conditions.

Mar 18
DevOps Days London 2013
icon1 Darrell Mozingo | icon2 Events | icon4 March 18th, 2013| icon3No Comments »
DevOpsDays I spent this past Friday & Saturday at DevOpsDays London. There’s been a few reviews written already about various bits (and a nice collection of resources by my co-worker Anna), and I wanted to throw my thoughts out there too. The talks each morning were all very good and well presented, but for me the real meat of the event for me was the 3 tracks of Open Spaces each afternoon, along with various break time and hallway discussions. I didn’t take as detailed notes as others did, but here’s the bits I took away from each Open Space:
  • Monitoring: – Discussed using Zabbix, continuous monitoring, and some companies trying out self-healing techniques with limited success (be careful with services flapping off and on)
  • Logstash: – Windows client support (not as good as it sounds), architecture (Zeromq everything to one or two servers, then to Elastic search), what to log (everything!)
  • Configuration Management 101 (w/Puppet & Chef): It was great having the guys from PuppetLabs and Opscode here to give views on both products (and trade some friendly jabs!). Good discussion about Window support, including a daily growing community with package support and the real possibility of actually doing config management on Windows. We’re using CFEngine, and while I got crickets after bringing it up, a few people were able to offer some good advise and compare with Puppet & Chef (stops on error like Chef, good for legacy support, promise support is nice, etc).
  • Op to dev feedback cycle: Besides the usual “put devs on call” idea (which I still feel is a bad idea), there was discussion about getting bugs like memory leaks prioritised above features. One of the better suggestions to me was simply going and talking to the devs, putting faces to names and getting to know one another. Suggestions were also made for ops to just patch the code themselves, which throws up a lot alarms to me (going through back channels, perhaps not properly tested, etc). I say make a pull request.
  • Deployment orchestration: Bittorrent for massive deploys (Twitter’s Murder), Jenkins/TeamCity/et al are still best for kicking off deploys, and MCollective for orchestration.
  • Ops user stories: Creating user stories for op project prioritisation is hard, as is fitting the work in for sprints. Ended up coming down to standard estimation difficulties – more work popping up, unknown unknowns, etc. Left a bit before the end to pop into a Biz & DevOps Open Space, but didn’t get much from it before it ended,
Overall it was a great conference. Well planned, good food, and great discussions. Nothing completely ground breaking, but a lot of really good tips & recommendations to dig into.
Jun 24
Software Craftsmanship 2012
icon1 Darrell Mozingo | icon2 Events | icon4 June 24th, 2012| icon3No Comments »

I attended the Software Craftsmanship 2012 conference last Thursday up at Bletchley Park. It was an awesome event ran mostly by Jason Gorman and the staff at the park. The company I work for, 7digital, sponsored the event so all ticket proceeds went directly to help the park, which is very cool. They’re in desperate need for funding and this event has brought in a hefty amount the past few years.

I did the Pathfinding Peril track in the morning. They went over basic pathfinding algorithms, including brute force and A*, and their applicability outside the gaming world. The rest of the session was spent pairing on bots that compete against other bots trying to automatically navigate a maze the fastest (using this open source tournament server). Unfortunately they didn’t have mono installed, so my pair and I wasted some time getting NetBeans installed and a basic Java app up and running. Very interesting, and it spurred a co-worker to setup a tournament server at work too. Looking forward to submitting a bot there to try out some path finding algorithms.

During our lunch break they gave a nice, albeit quick, tour of the park. We got to see the main sites, including Colossus. Very interesting stuff, and amazing to hear how they pulled off all those decoding and computational feats during the war.

For the afternoon I went to the Team Dojo session. We were told to write our strongest languages on name badges, then break off into teams of 4-6 based on that. I got together with a group of 6 devs, some co-workers. After a brief overview of the Google PageRank algorithm and a generic nearest neighbor one, we were set loose to create a developer-centric LinkedIn clone from a complete standing start. We had to figure out where to host our code, how to integrate, code the algorithms, parse in XML data, and throw it all up on the screen somehow in around 2 hours. Unfortunately we spent way too much time shaving yaks, as it were, with testing and our CI environment, and didn’t get to the algorithms until the end (although we were close to finishing it!). Learned a bit about trying to jump start a project like that with different personalities and making it all mesh together. It’d be interesting to see how we’d all do it again, especially since katas are meant to be repeated.

Between the talks, lunch, hog roast dinner, tour, and the great little side discussions had between it all, it was an excellent event (although they could try doing something about those beer prices!). Everyone did a great job putting it on. Here’s a video of the day Jason put together (I’m one of the last pair of interviews during our afternoon session). I’m quite looking forward to attending it again in the future.

Dec 30
Continuous Delivery
icon1 Darrell Mozingo | icon2 Build Management | icon4 December 30th, 2011| icon3No Comments »

I recently finished reading Continuous Delivery. It’s an excellent book that manages to straddle that “keep it broad to help lots of people yet specific enough to actually give value” line pretty well. It covers testing strategies, process management, deployment strategies, and more.

At my former job we had a PowerShell script that would handle our deployment and related tasks. Each type of build – commit, nightly, push, etc. – all worked off its own artifacts that it created right then, duplicating any compilation, testing, or pre-compiling tasks. That eats up a lot of time. Here’s a list of posts where I covered how that script generally works:

The book talks about creating a single set of artifacts from the first commit build, and passing those same artifacts through the pipeline of UI tests, acceptance tests, manual testing, and finally deployment. I really like that idea, as it cuts down on unnecessary rework, and gives you more confidence that this one set of artifacts are truly ready to go live. Sure, the tasks could call the same function to compile the source or run unit tests, so it was effectively the same, but there could have been slight differences where the assemblies produced from the commit build were slightly different than those in the push build.

I also like how they mention getting automation in your project from day one if you’re lucky enough to work on a green-field app. I’ve worked on production deployment scripts for legacy apps and for ones that weren’t production yet, but still a year or so old. The newer an app is and the less baggage it has, the easier it is to get started, and getting started is the hardest part. Once you have a script that just compiles and copies files, you’re 90% of the way there. You can tweak things and add rollback functionality later, but the meat of what’s needed is there.

However you slice it, you have to automate your deployments. If you’re still copying files out by hand, you’re flat out doing it wrong. In the age of PowerShell, there’s really no excuse to not automate your line of business app deployment. The faster deliveries, more transparency, and increased confidence that automation gives you can only lead to one place: the pit of success, and that’s a good place to be.

Nov 14
Moving on
icon1 Darrell Mozingo | icon2 Misc. | icon4 November 14th, 2011| icon3No Comments »

I’ve been at Synergy Data Systems for over 7 years now (I know, the site is horrible). I’ve worked with a lot of great people on some very interesting projects, and learned a boat load during that time. Unfortunately, they can’t offer the one thing my wife and I wanted: living abroad.

To that end, we’re moving to London and I’ll be starting at 7digital in early January. I’m super excited about both moves. 7digital seems like a great company working with a lot of principals and practices that are near and dear to me, and c’mon, it’s London. For two people that grew up in small town Ohio, this’ll be quite the adventure!

I’m looking forward to getting involved in the huge developer community over there, playing with new technologies, and working with fellow craftsmen!

Sep 29

UPDATE: See Paul’s comment below – sounds like the latest cygwin upgrade process isn’t as easy as it used to be.

If you install GitExtensions, up through the current 2.24 version (which comes bundled with the latest msysgit version 1.7.6-preview20110708), and use OpenSSH for your authentication (as opposed to Plink), you’ll likely notice some painfully slow cloning speeds. Like 1MB/sec on a 100Mb network kinda slow.

Thankfully, it’s a pretty easy fix. Apparently msysgit still comes bundled with an ancient version of OpenSSH:

$ ssh -V
OpenSSH_4.6p1, OpenSSL 0.9.8e 23 Feb 2007

Until they get it updated, it’s easy to do yourself. Simply install the latest version of Cygwin, and make sure to search for and install OpenSSH on the package screen. Then go into the /bin directory of where you installed Cygwin, and copy the following files into C:\Program Files\Git\bin (or Program Files (x86) if you’re on 64-bit):

  • cygcrypto-0.9.8.dll
  • cyggcc_s-1.dll
  • cygssp-0.dll
  • cygwin1.dll
  • cygz.dll
  • ssh.exe
  • ssh-add.exe
  • ssh-agent.exe
  • ssh-keygen.exe
  • ssh-keyscan.exe

Checking the OpenSSH version should yield something a bit higher now:

$ ssh -V
OpenSSH_5.8p1, OpenSSL 0.9.8r 8 Feb 2011

Your clone speeds should be faster too. This upgrade bumped ours from literally around 1MB/sec to a bit over 10MB/sec. Nice.

Sep 15
Getting started with TDD
icon1 Darrell Mozingo | icon2 Musings, Testing | icon4 September 15th, 2011| icon3No Comments »

When I first read about TDD and saw all the super simple examples that litter the inter-tubes, like the calculator that does nothing but add and subtract, I thought the whole thing was pretty stupid and its approach to development was too naive. Thankfully I didn’t write the practice off – I started trying it, plugging away here and there. One thing I eventually figured out was that TDD is a lot like math. You start out easy (addition/subtraction), and continue building on those fundamentals as you get used to it.

So my suggestion to those starting down the TDD path is: don’t brush it off. Start simple. Do the simple calculator, the stack, or the bowling game. Don’t start thinking about how to mix in databases, UI’s, web servers, and all that other crud with the tests. Yes, these examples are easy, and yes they ignore a lot of stuff you need to use in your daily job, but that’s sort of the point. They’ll seem weird and contrived at first, but that’s OK. It serves a very real purpose. TDD has been around for a good while now, it’s not some fad that’s going away. People use it and get real value out of it.

The basic practice examples getting you used to the TDD flow – red, green, refactor. That’s the whole point of things like kata’s. Convert that flow into muscle memory. Get it ingrained in your brain, so when you start learning the more advanced practices (DIP, IoC containers, mocking, etc), you’ll just be building on that same basic flow. Write a failing test, make it pass, clean up. You don’t want to abandon that once you start learning more and going faster.

It seems everyone gets the red-green-refactor part down when they’re doing the simple examples, but forget it once they start working on production code. Sure, you don’t always know what your code is going to do or look like, but that’s why we have the tests. If you can’t even begin to imagine how your tests will work, write some throw away spike code. Get it working functionally, then delete it all and start again using TDD. You’ll be surprised how it changes.

Good luck with your journey. If you’re in the Canton area, don’t forget to check out the monthly Canton Software Craftsmanship meetup. There are experienced people there that are eager to help you out.

Jul 28
Commenting out old code kills puppies
icon1 Darrell Mozingo | icon2 Musings | icon4 July 28th, 2011| icon31 Comment »

There, I said it. Actually, I’m kind of worried that title won’t adequately state the intensity of this situation.

This is one of the fundamental reasons we have source control people, so we can go back through a file’s history and see the different revisions. Please, for the love of all that is holy, don’t comment out old code. Just delete it! Feel free to slap your own knuckles with a ruler if you start to think about commenting it. Don’t try to recreate a source control system through commented out code. Everyone knows exactly what I’m talking about:

// John Doe - 7/5/2011 - Changed to allow a higher limit.
// dozens of lines of old code....
 
// John Doe - 7/18/2011 - Changed algorithm slightly.
// dozens of lines of old code....
 
// random dozen lines of old code with no comment at all
 
public void ActualCode() { }

Those extra comment chunks are just crap to sift through to get to the real code, extra stuff you’ll have to parse to see if it’s relevant to the current situation, and creating more false-positives for ReSharper (and I’m guessing other refactoring tools) to pick up when you rename a variable/method that’s used inside those commented chunks. That chunk of old code at the bottom without even a hint as to why it’s commented out? That’s the worst of the worst – someone’s going to sit there and stare at it for a good while before they figure out why it was commented out, and we know when the author actually committed this file with that commented out the commit comment was blank too. Awesome.

So anyway, just remember what actually happens the next time you’re about to comment out old code and don’t do it, you’ll be doing future programers (and more than likely yourself) a huge service…

Commenting code kills puppies
Jul 20
Consistent modal dialogs, the easy way
icon1 Darrell Mozingo | icon2 Web | icon4 July 20th, 2011| icon3No Comments »

So we all know the default alert dialog box visually sucks. Any of the hundreds of jQuery modal plugins work wonderfully for replacing it with something a bit snazzier (although putting the information on the page for a user is even better, but that’s for another post). The biggest problem with most of those dialogs are either the setup cost, or the memory cost:

  • Setup cost: having to set heights, widths, button names, text & title fields, yada, yada, yada. A lot of that can be skinned through CSS, and a lot of plugins reduce that noise to virtually nil, but many leave a lot on your pages. It’s ugly to look at in your code, and ugly to configure. Not to mention all those config settings spreads through your code base like the freakin’ ground ivy is spreading through my lawn as I type this. Want to change the widths for a new redesign, or localize the button names? Good luck!
  • Memory cost: relates to the Pit of Success. Do you really want the burden of always remembering to use that modal dialog instead of alert? What about the new guy, is he going to know or remember? Sure, forgetting isn’t that big of a deal, but given enough slip ups and your nice consistent UI goes to hell. Tests checking for calls to alert are also possible via straight searching through files or through UI tests some how, but I can see a future of false positives ahead of that idea.

How about a better way? With some very slight Javascript-foo, you can override the default alert and confirm dialogs so not only is there nothing to copy & paste between pages, but you don’t even have to remember to use your nifty modal boxes – it’ll just happen. We’ll use the jQuery UI Dialog plugin inside a stock ASP.NET MVC app, though this is easily transferable to any other platform or with any other modal plugin.

First we’ll override the default alert method on the window object, calling the dialog function from jQuery UI and setting some default parameters:

window.alert = function (message) {
	$("#dialog")
	.html("")
	.html('<span class="ui-icon ui-icon-alert custom-ui-icon"></span>' + message)
	.dialog({
		autoOpen: true,
		resizable: false,
		height: 200,
		width: 350,
		title: "Alert!",
		modal: true,
		buttons: {
			"OK": function () {
				$(this).dialog("close");
				return;
			}
		}
	});
};
 
alert("Error!!!");

The HTML is cleared out and a default alert icon (from jQueryUI) is added via the class attribute ui-icon-alert. This allows us to create a standard <div id="dialog"></div> in our master page with nothing inside it, and reuse it for alert/confirm/prompt boxes. Then a standard alert call, like the one at the bottom, gives us:

Alert modal dialog vs Default alert

Similarly, we can override the default confirmation box. Here’s a version that’ll take the title, a message to show, and a callback function to execute if the user clicks “OK”:

window.confirm = function(title, confirmMessage, successCallback) {
	$("#dialog")
		.html("")
		.html('<span class="ui-icon ui-icon-help custom-ui-icon"></span>' + confirmMessage)
		.dialog({
				autoOpen: true,
				resizable: false,
				height: 200,
				width: 350,
				title: title,
				modal: true,
				buttons: {
					"Yes": function() {
						$(this).dialog("close");
						successCallback();
						return;
					},
					"No": function() {
						$(this).dialog("close");
						return;
					}
				}
			});
};
 
confirm("Are you sure?", "Are you sure you want to create a confirm?", function() { alert("Sweet, all done!"); });

Again, compare the results (the first question mark is an icon from jQuery UI, which can also be changed with the class written out in the confirm method above):

Modal Confirm vs Default Confirm

Pretty neat, if you ask me. This whole thing is very DRY, as everything you need is referenced in your master page (the dialog div, the javascript & css files, etc) and your individual pages don’t need to include anything – just call away. You also don’t have to remember to call special methods (or at least terribly special ones in confirm‘s case). It just works.

It’s not too hard to imagine extending this system to override the default prompt box either. Just pass in a callback that’ll set whatever string you need, similar to how confirm works above. Closures work wonders.

You can grab the code used in this post right here.

Jun 26
The Pit of Success
icon1 Darrell Mozingo | icon2 Design Principles | icon4 June 26th, 2011| icon31 Comment »
Pit of Success

I’m a huge believer in the Pit of Success. Quite a few have written about it before, though not always in development terms. Put simply, there’s two pits you can create in your application through conventions, infrastructure, culture, tools, etc: success and failure. I obviously choose the former.

The Pit of Success is when you and the other developers on your team have to think less about the mundane stuff and when there’s only one easy development path to follow. Less thinking about crap = more thinking about business problems = faster software with less bugs. In general, if I see something that’s going to be in a lot of classes/pages and it has a decent bit of setup and baggage for it, I instantly picture another developer forgetting to bring all that along when they start new features or refactor. If things break (visually or programatically) when that happens, there’s a problem. Same goes for huge chunks of documentation explaining how to use a certain feature elsewhere in the system – time to make it easier to use! Here’s a few examples of how we’ve dug out a Pit of Success on our current project:

  • Need to create a new schedule task? Drop in a class and implement a simple interface. That same principal goes for a slew of other areas – missing information checks for users, HTTP handlers, sample data for geographic areas, etc. You don’t have to go hunting down a master class to add these new things to, just create the class and you’re golden.
  • Security in our system isn’t that complex yet, so we’re able to consolidate everything in a nice tidy ActionFilter. It’s applied to our custom controller, and we have a unit test that makes sure all Controllers in the system inherit from that custom one. So by following the rules (on your own or with the help of a broken test), you get security handled for you auto-magically.
  • We continuously deploy with our build server, so it takes care of not only making sure all our unit/integration tests pass, but that all the needed debug settings are flipped, sites are pre-compiled, everything still works once it’s live, etc. That saves us from remembering to do all that every time we push live, which is almost constantly these days.
  • We completely agree with Chad Myers, Jeremy Miller, et al: if we’re working with a static language, make the best of it. Everything in our system is strongly typed, from text box ids in HTML/Javascript/UI tests to URLs and help points. You shouldn’t have to remember to go hunting and pecking through the whole system when you want to rename something, just rename it with ReSharper and move on. Same with finding where something is being referenced. The harder it is to rename things, the less they get renamed, and the crustier the system gets.
  • We started creating one off modal dialogs to present information to the user. They looked great, but needed a lot of baggage and duplication to do it, so we overrode the default alert and confirm dialogs with our modal ones. Now there’s not only nothing to add to your page to get this, but in most cases you don’t even have to remember we’re overriding it! There’s a forthcoming post that’ll cover what we did in more detail.
  • We have a unit test that’ll scan through all of our test files (which end with *Fixture), and make sure there’s a file name that matches (sans the Fixture part) in the corresponding directory structure in the main assembly. We constantly move files around when refactoring, and forgetting to move or rename their test files is a pain, so this test gently reminds us. Note we don’t always follow a one-class-per-fixture setup, but even when we don’t, we stick them in a matching fixture class for easy grouping and ReSharper discoverability.

It’s worth noting we didn’t set out from day one to build all this stuff. Its all grown over time as the project and our stakeholder’s needs have changed. We always strive to keep KISS in mind (even if it is hard) and not build anything until it’s absolutely needed. Don’t try to create infrastructure to handle everything for you when a project’s in its infancy. Harvest it out later.

There’s also exceptions to all of these rules. Is automatic security always the right thing to do? No – if you need highly configurable security, put it out in the open and remember to set it on each request. Don’t force things into the infrastructure if they’re fighting you and have lots of exception cases. Perhaps there’s another route of attack that’ll solve the problem and still keep you circling around the Pit of Success.

You’ll create your own Pit of Success on your project just by falling into the bigger Pit of Success that is the SOLID principals. The majority of the examples up there were arrived at by just adhering to the Open/Closed Principal or the Single Responsibility Principal. They create a sort of recursive pit, I suppose.

What have you done on your projects to help create a Pit of Success?

« Previous Entries