Let’s set the stage. My developer, Anton, has a Jira ticket assigned to him for adding a feature to our product VyScale. How does he go about doing it? The process of the Jira feature request to production is what we call developer flow. This is a complex process with cycles. A typical flow consists of several checkpoints that the code has to go through before it gets into production and can be called done.
One possible flow is as follows:
O: Developer changes code
A: Local manual testing using stdout, logs, etc.
B: Post-merge master branch testing
C: Testing by the ops guys
D: Monitoring in production
Now, each one of these checkpoints acts as a gating function. If the code passes, it moves to the next checkpoint. If it fails, it is sent back to checkpoint 0 (developer).
Thus, one possible flow is:
O A O A O A O A O A ... O A // developer testing locally until he/she commits
B O A B O A B O A B // developer loops back until B is passed
C O A B C O A B C // code is put in production once C is passed
D O A B C // code is sent back to developer, if failure in production
D // code is stable in production
Now, the sequence has a cost associated with it. In fact, a higher quality development is the one where Seq 1 is the longest of all sequences, while other sequences are non-existent. Why is that? Because the cost of failing at A is very low. The cost of failing at B is higher because the:
developer has moved on to something else and will have to “context switch” back. This context switch is the biggest cost of slow development.
resources have been wasted in performance B.
Failing at C is even more expensive as the cost of B has already been paid. Plus, the developer has moved on for even longer, and the cost of context switching back is even higher. The context switch cost increases with the time from when developer wrote the code to the time feedback is received.
Failing at D has even more at stake because the bug could impact the entire company’s public image.
Therefore, a few things can be established:
Every checkpoint N has:
C<N>: A cost of performing the check (developer cost is not included)
T<N>: Time taken in performing the checkpoint
PF<N>: Probability of failure
CL<N>: Cost of failure (aka, loopback)
CL<A> = C<A> + k(T<A>)
CL<B> = C<A> + C<B> + k(T<A> + T<B>)
CL<C> = C<A> + C<B> + C<C> + k(T<A> + T<B> + T<C>)
CL<D> = C<A> + C<B> + C<C> + C<D> + k(T<A> + T<B> + T<C> + T<D>)
Total cost = C<A … PD> + CL<A> * PF<A> + … +CL<PD> * PF<PD>
Now, let’s try to optimize this flow. There are many different ways to handle this.
We can see that it is key to reduce T for every bottleneck.
“A” will be executed for all failures for A, B, C and D. Thus, it is key to keep T<A> very low. This implies proactively investing in developer productivity, such as providing a stable development environment. Providing scripts for common tasks. Providing training and documentation where necessary. Removing unnecessary build steps from the code flow.
For example, if adding a particular feature one way adds 10s of build time and saves one man a month of developer time, it’s not worth it. If the build time increases from 100s to 110s, your developers will now be 1% less productive. In a 50-developer team, the cost of a 1% reduction in productivity (five extra minutes per day) is 1,000 man hours, which is equivalent to six man months. As a result, it is worth hiring a new employee or contractor to do this feature the right way.
Minimizing T<A> implies that scripts for the developer to run (or compile and run) with the newly edited code can be done in under two seconds. It is worth investing the effort to optimize this flow to the Nth degree, e.g., copying logs for the developer as they are written.
When looking at T<B>, we should not try to catch as many bugs at B as possible while keeping T<B> under control. It’s tempting to make B lean and fast to reduce CL<B>, but it’s a mistake if keeping B lean leads to more failures at C or D. If it seems difficult to make B lean and fast, it’s beneficial to split B into two checkpoints, B1 and B2. B1 is the fast checkpoint, which can get the feedback quickly, and B2 gets it a little later. It’s okay to split B into multiple steps, if need be, to achieve this.
There are a few things that can help here. For example, throw machines at the problem and invest in distributed QA solutions that can allow running QA jobs as much as possible in parallel. Also, it’s not necessary to wait for all the results to gather before giving feedback to the developer. It is important to note that QA jobs do not wait in the queue for independent tasks. This strongly justifies investing in a QA system that auto-scales. After all, there is a probability that you will pay more than 100x for every one hour your QA jobs sit in the queue.
A common solution is to force discipline onto the developers and request that they test their code more thoroughly before submitting it to the queue system. This is also a mistake. Basically, this is the equivalent to increasing PF<A> and reducing PF<B>. The issue is that if A is made less lean, it begins to slow down the developer on every loopback of A. It is better to build a system that actually prefers to waste machine time than human time.
On this note, I have heard ops guys say that developers should be developing in remote environments in the cloud. This is often also a bad idea because by imposing a restriction on developers to work remotely, you lose all the time they would have worked when the Internet connection is slow or weak. We all know that our laptops stay up more than our WiFi connections. Plus, adding the network delay in O and A can be extremely frustrating. If your developers cannot use their favorite editors or shortcuts, and they have to struggle with Unix screens, you have lost the entire argument. If the time to open another terminal takes your developers an extra 30 seconds, due to which they continue to multi-task in the same terminal, you have lost the argument again.
Think of it this way. For a 50-developer team, a loss of only one min/day leads to an overall loss of $21k. Now, if you slow them down by 10%, you have just lost $1M/yr. Is it really worth it?
If anything, one should invest in hiring developer teams just to build tools to make your developers productive. Saving only 10 mins of each developer's time every day pays for one full time-support engineer. Don’t try to enforce discipline on the O-A flow. Let the developers do what’s most productive for them, but make checkpoint B a gating function for enforcing discipline on them. However, keep B quick and make it exhaustive by throwing machines at it.
How do you optimize C? C is tricky, as we have a human element. The overhead of the ops guys. Ops guys are also expensive and often the bottleneck in any organization. If anything, wasting their time is even worse than wasting a developer’s time.
This is where modern technology comes in handy. Moving as much of C into B is the key. The last test in B shall spin up a full production environment and do full testing, including load testing. The deliverable to C shall be a production-ready system. Developers are not solution architects, so these systems can only be ad-hoc. But, it’s possible to use constructs that port well. This is where shipping containers, or VMs instead of git repos, can help significantly. Also, it is useful to view machines as black boxes and keeping ops a layer above is better.
For example, by setting up these workflows, management receives immediate business benefits. The developers are more productive, because they are no longer doing context switches. The developers are happy because they get to focus on what they love, and are evaluated rather than dealing with distractions.The bring-up process for new hires is greatly simplified. And, most importantly, the quality of the code improves, because testing is a part of the process from day one.
Looking for more information? Download our paper: Six Ways Web Based Businesses Can Use DevOps to Improve Web Development Workflow