- @alq @datadoghq
- DevOps <3 Customer Support
- Everyone has customers
- Devops is:
- Culture
- Automation
- Measurement
- Sharing
- engineering culture
- build and understand
- learn
- automation
- autoemails
- measurement
- quality
- new cases
- reopened cases
- engagement
- time to first response
- time to resolution
- first contact resolution %
- colume per channel
- sales
- impact of support on sales
- quality
- sharing
- Q&A
- how do you prevent your support team from optimizing metrics?
- pay is not indexed on metrics
- we haven’t really had the problem yet
- how do you prevent your support team from optimizing metrics?
- JRebel
- eliminates waiting times between coding and seeing results
- LiveRebel
- orchestration of release management
- manages deployment of war file
- “I don’t want to be woken up at night, so I call myself a developer”
- How did a devops mindset improve the software I write?
- In the beginning:
- agency job
- LAMP stack
- Dev/Ops split
- wasn’t even that bad…
- Ops knew what they were doing, didn’t want to be bothered by developers at all
- breaks down when a project becomes a little bit special
- scale (beyond one server)
- then: most efficient team was admin + programmer + cup of coffee
- example
- pull videos from various incoming feeds, aggregate, output
- ftp/rss/whatever
- one big program import/encode
- one program with architecture
- import threads
- in-memory queue
- encoding
- split into several processes
- importers
- persistent queue
- encoders
- dev perspective
- first is easiest to implement
- last one is hardest to implement
- ops perspective
- first is hardest to operate
- not much introspection (eg how full is the queue?)
- if one part fails, whole system fails
- last one is simplest to operate
- first is hardest to operate
- pull videos from various incoming feeds, aggregate, output
- code written for infrastructure
- follow conventions that the ops team will understand
- external quality > internal quality
- (interfaces!)
- My ugliest piece of code ran for 1.5 years in prod with no changes.
- nobody cared
- “Ease of setup” is a red flag
- prevalent in database world
- problematic later. complexity is hidden
- you might miss things along the way
- “ease of non-trivial configuration” - much more important
- set up a production-like system early on
- get ops involved at this point
- keep tally marks on how often it leaves you puzzled
- everything before this is just a hobby
- big bangs always happen in production
- expensive loadbalancers that die during configuration and need to be sent back to the manufacturers
- teams need to get used to their own systems
- metrics & logging are important
- but most dev teams underrate them and leave them late and poor
- should be there from day one
- “we’ll implement that later”
- internal tooling can save you a lot of work
- need creative ways to talk about it
- problems of strict roles
- platform refactorings
- get political very quickly
- lack of skill
- code reviews
- not enough staff that “is qualified” to review certain code
- platform refactorings
- devops mindset takes away friction
- Q&A
- how do you avoid needing everyone to know everything?
- breaking down role boundaries isn’t about enforcing everyth
- how do you avoid needing everyone to know everything?
- Project
- Activity
- action
- specific need
- time
- budget envelope
- it dept
- Product
- creative activity
- satisfies needs
- client
- employees are end users
- users = team + end users
- Learn->build->measure->learn
- Think Product. Do Project.
- much slower and riskier than can be estimated
- 1 project in 6 cost 3x more than expected
- large-scale compute spend 20x more likely to spiral out of control than expected (than what?)
- FoxMeyer Drugs’ bankruptcy after switching brutally to SAP
- Black swans (really?)
- bell curves and kurtosis
- Small iterations reduce risk
- if it’s going horribly wrong, cut your losses
- devops is natural conclusion of lean startup thinking
- typical junior dev knows nothing about ops
- learnt java at uni
- uses windows (for games)
- has installed linux for curiosity
- maintained a website for a uni club
- uploaded files using FileZilla/CuteFTP
- First skills are easy to teach a dev
- linux
- git (not poisoned by CVS)
- git branching
- scrum
- unit and functional tests
- Shit gets real with deployment and provisioning
- shell scripts :(
- capistrano - magic :(
- fabric - good compromise?
- every project has a deploy.py script in a
devops
folder - everyone can now deploy
- juniors seem to check if there is an experienced guy around
before deploying…
- do they not trust the rollback system?
- Asked sysadmin-knowledgable devs: how did you learn?
- “I imrpoved a lot when I started renting my own server”
- To acquire experience, you need sandboxes for devs
- new instance per project
- easy to reset
- IaaS!
- we have a homemade IaaS platform as the testing server
- multiple opportunities for a dev to play with a clean linux environment
- scripted provisioning must stay simple!
- “it’s become so complicated since my CuteFTP days, that even the senior devops-type folk don’t know what to use”
- 80% of needs: install packages, modify config files
- typically on one server
- not the use-case of chef server/puppetmaster: designed for clusters
- chef-solo: a “who writes the most magical Ruby” contest
- puppet/chef modules: do they actually work out of the box?
- too much abstraction anyway (what?)
- fabtools: easy and dev-friendly! not very standard
- vagrant!
- awesome!
- tiny issues on the file-sharing
- deep cultural differences
- DHH vs Stallman
- “Touching a server is risky”
- sysadmins should always pair with devs
- scrum-compatible? yes (apparently)
- PS: hire devs who are eager to learn!
- Some issues are constraints for developing fast
- TDD (in the view of a junior dev)
- Performance
- Provisioning
- Scaling
- Backups
- none of these matter in a dev environment
- Use visual management as a natural incentive
- Make performance constraints part of DONE definition
- if you expect 500ms page load times, but you don’t write it down, how else do you make it happen?
- Include performance solutions in the standard provisioning
- if you’re using varnish in production, and want devs to use correct caching headers, need varnish in development
- do visible perfomance graphs
- Make performance constraints part of DONE definition
- Make scaling cool
- The power of NoSQL: devs have to think the model in a scalable way
- dev doesn’t have option of making this amazing 50-line SQL query
- scalable by default
- IaaS is infrastructure with an API…awesome!
- The power of NoSQL: devs have to think the model in a scalable way
- How do you get devs to feel responsible for production?
- why do ops feel that responsible, and not devs?
- dev: “it’s not my job”
- (is it the “devops” job?)
- What about backups and server monitoring?
- “I push the backup button in the GUI of my VPS”
- “I check the response time in Pingdom after every big event. If it is too high I know it is time to do something.”
- For devs turned into devops, SaaS is the solution:
- newrelic
- pingdom
- Idera ServerBackup
- And for bigger needs? personally I turn to real ops
- what’s the right ratio for dev/ops in a scrum team?
- in my example, ops was an outside expert, not part of the team
- the solution was: when an outside expert is needed, make him pair
- Infrastructure improvements, support, firefighting
- Devops spends 33% more time improving infrastructure than traditional ops
- Traditional IT Ops require nearly 60% more time supporting
- Devops spends 21% less time putting out fires
- yay!
- Devops spend more time on self-improvement
- devops recover from failures faster
- devops need less than half the time to release an application (36 min vs 86 min)
- devops spends more time improving things and less time fixing things
- top tools
- sh
- se
- vim
- nagios
- puppet
- python
- chef
- config tools
- puppet 40%
- chef 31%
- bash
- cfengine
- ansible
- fabric
- test automation
- se
- junit
- custom
- jmeter
- jenkins
- soapui
- rspec
- monitoring
- nagios
- custom
- newrelic
- zabbix
- graphite
- pingdom
- munin
- devops wins! but still isn’t perfect
- failures due to: software quality or lack of automation
- report is available to everyone for free
- premature optimization is the root of all evil
- stop guessing
- eg using for (;;) loops instead of for (o : os) loops
- the last mile problem: performance
- architecture
- development
- performance test
- go live!
- except performance sucks
- massive project delays
- only realized at the last minute
- there is a better way
- I have lived there
- when I was young, I was managing things all the way from dev to prod
- what happened? the Original Sin
- overcommitment
- too many bugs
- and God said: structured waterfall
- tower of babel
- specialization to win
- but specialists developed different languages
- we can’t go back, because the complexity is here
- we need the specialists
- flood of projects
- you have to be moses and open the red sea
Karanbir Singh http://www.karan.org
- Project Raindrops
- kickstart file
- config file for type of hypervisor etc
- create a job which defines config + kickstart, get an image
- image generation aaS - awesome
- user pickup by default
- raindrops can set up aws image for you
- builds happen on real hypervisors
- lessons from first few days:
- bitcoin miners
- spambots
- proxies from china & iran
- http://projectraindrops.net/
- A cure for everything!
- nobody knows the secret formula
- celebrate delivery
- communication
- automation
- measuring
- sharing
- test driven
- devops/noops/infracoders
- the end users love it
- solve your own bottleneck, adapt it to your needs.
- How do you handle refactoring which spans more than one app, more than one team, more than one API boundary?
- eg: multiple interconnected webapps, each maintained by a different team.
- Step 1: simplify. Kill features! Features carry a cost, and if they are harming performance, they may not be carrying their weight
- API versioning
- Branch by abstraction for APIs:
- start with frontend hitting API-v1
- introduce API-v2
- gradually migrate frontend from using API-v1 to API-v2, one call at a time
- check it all works
- remove API-v1
- semantic versioning – communicate when your APIs break
- only really viable if you have confidence that your test suite will catch breaking changes
- API versioning can be problematic for legacy reasons. If you’re
close to your API consumers, it’s easy to drop old API
versions. If you’re far removed (say, with a public API), you
may have to keep maintaining multiple API versions for a long time
- though one way to alleviate this can be to reimplement API v1 as a shim on API v2; that way, you have much less code to maintain and (more importantly) less duplication
- Branch by abstraction for APIs:
- Team interplay
- If I’m a frontend dev and I want to instigate a change to a
backend API, how should I go about it?
- fork the repo and JFDI!
- but get it reviewed by the backend team
- go over and pair with someone from the backend team for a while
- good communication is key – your JavaScript developer may not have the ability to just fix it themselves; but they should feel able to approach backend team for help
- fork the repo and JFDI!
- How do application developers get feedback about production
performance?
- Who deploys the code?
- if devs are deploying their own code, they should also be watching the monitoring as they deploy it
- this builds up familiarity with the metrics relevant to their app
- then the devs know if their app is worsening in performance
- issue: there may be 20,000 graphs. The ops will have much
more knowledge than the devs in how to wade through all this
data, but the devs should be more directly interested in the
data specific to their app.
- mitigation: pair an ops and a dev on making a dashboard
- devs & QAs to own the dashboard
- ideas for metrics:
- response time
- SQL query time
- disk, cpu, memory
- iowait
- Who deploys the code?
- If I’m a frontend dev and I want to instigate a change to a
backend API, how should I go about it?
- Kanban wall:
- columns
- limited WIP
- optimize for throughput
- if you hit a WIP limit, stop the line and fix the blockage
- sharing information internally
- wiki
- example: developer configuring logstash, simultaneously writes wiki page for logstash
- “we’re an ops team but we have CI specialists and production
specialists. would we need one kanban board or two?”
- kanban works for cross-functional teams
- anyone can pick up any story
- if you violate this, your velocity may become misleading
- eg CI folks consistently do a 1-point story 3x faster than production folks; velocity will become skewed by prod folks
- kanban works for cross-functional teams
- how do you handle interrupt-driven work within kanban?
- ie if you’re at a WIP limit, how do you handle an urgent production outage?
- make kanban work for you; don’t become a slave to it
- if WIP limit is preventing you from working, increase it
- but what’s the point of a WIP limit if you just raise it
when you hit it? Isn’t it trying to tell you something?
- if you hit a WIP limit, yes you should investigate
why. Correct action will depend
- if someone’s struggling with their story, stop the line & help them?
- too many outstanding pull requests and not enough people reviewing - JFDI
- team is working well, but some people have nothing to do – raise WIP limit
- if you hit a WIP limit, yes you should investigate
why. Correct action will depend
- but what’s the point of a WIP limit if you just raise it
when you hit it? Isn’t it trying to tell you something?
- if WIP limit is preventing you from working, increase it
- can have an “urgent” lane for interrupts
- drop an existing task to a “blocked” column to make room for the urgent task
- how does this happen in scrum vs kanban?
- in scrum, if I have a production outage which my most senior dev takes on, I can look to reduce our commitment for the sprint in order to communicate the loss of capacity to the stakeholders. How does kanban cope?
- iterations are an accounting tool. you measure velocity and use yesterday’s weather to do planning. If you think you’ve taken a hit, you can communicate an expectation of a reduced velocity for the iteration.
- maven project in eclipse
- tomcat (with jrebel) running
- edit JSPs & properties file & save
- instantly see results in tomcat
- edit java validation logic & save
- ditto
- HotSpot can do some of these things, but JRebel can do advanced things such as big refactorings, adding classes, deleting classes, etc
- orchestration
- releases should be testable, reversible, automated, consistent
- command centre
- build artefact for an app (code+db+conf)
- one-stop mgmt console
- see what’s currently deployed to any given server
- app view:
- see what versions of what apps
- drill down into artefacts to see what’s in place
- diff between versions
- Software developer, GDS
- embedded inside Cabinet office, inside HMG
- civil service - 464000, google 55000, BBC 20000
- GDS started 18 months ago
- developers
- designers
- writers
- policy
- comms
- operations
- Tiny govt dept + web agency + creative agency
- “Digital services so good that people prefer to use them”
- rather than paper, phone, etc
- GDS own the UX end-to-end
- focus on user need
- not government need
- ship fast
- measure everything
- focus on user need
- Martha Lane Fox’s report “Revolution, not evolution”
- Build a centre of excellence
- this is GDS
- Fix publishing
- Fix transactions
- Go wholesale
- Build a centre of excellence
- Step #1: GOV.UK
- Martha’s Report arrives
- Cabinet Office: “show us”
- shipped in 12 weeks
- design principles
- start with needs
- open source
- provide back to community
- https://github.com/alphagov
- releases: cap deploy
- about 4 developers
- deploy straight from local machines to single server in AWS
- lovingly handcrafted
- cap deploy couldn’t scale
- jenkins to provide deployment queue
- and provide repeatability
- commit to master
- kick off jenkins build
- tests pass (or fail, and people become noisy)
- create tagged release
- big visible build dashboards
- and provide repeatability
- acceptance criteria for smoke tests
- cucumber tests
- alphagov/smokey
- eg
- Given EFG has booted
- and I am benchmarking
- when I visit
- the elapsed time should be less than 1 second
- also used in nagios
- cucumber tests
- configuration management
- puppet/chef/pallet
- we used puppet
- because we knew it
- main selling point: faster than other suppliers
- met dept
- “this is what we’d like. Can you work out if it’s possible? let’s meet again in a month”
- built, shipped in 3 days
- break things
- devs had sudo access
- met dept
- user needs: when do the clocks change?
- go wholesale: public api
- make it easier to do the hard things
- exposing the release process to the designers
- allow design team to deploy static assets straight to prod
- fun for them, scary for us
- simplify with abstractions
- how many people know a scripting language?
- how many people quantum mechanics?
- scripting languages
- VM
- machine code
- transistors
- semiconductors
- quantum mechanics
- (c) @nickstenning
- quantum mechanics
- semiconductors
- transistors
- machine code
- VM
- puppet abstractions for deploying apps
govuk::apps::search
- rack application
- could be JVM, perl, etc
- health check url
- sets up:
- logging
- alerts
- load balancing (through healthcheck)
- contains a lot of the magic
- Okay, shit gets real now
- Removing control
- can’t give everyone sudo anymore
- 2nd line operations
- two ops, two devs
- enforced rotation each week
- gatekeepers to production
- release calendar
- release application
- show what’s currently in staging or prod
- @psd: “shipped a nation’s website. No biggie”
- midnight release
- flipped over DNS
- redirected ALL THE URLS
- old bookmarks continued to work
- Design of the year award 2013!
- International update Rails day
- 8th Jan this year
- 2030: exploit comes out (CVE hits mailing lists)
- 2035: hits internal mailing list
- team formed
- patched & deployed within 90 minutes
- only devs, no ops
- noone was on call; people just cared
- Government is big
- Focus on user need
- ship fast
- measure everything
- speed is important, but momentum is everything
- https://www.gov.uk/service-manual
- devops
- multidisciplinary teams working together
- solving user needs
- not marketing
- not consulting
- not a “devops team”
- you can’t be “an agile”; you can’t be “a devops”
- devops
- Did you get any pushback or internal politics?
- the response was “this can’t be done” “government is different” “you don’t understand our use-case”
- we said: “we’ll iterate”
- instead of talking about it in lengthy meetings, we shipped stuff
- if you ship stuff and meet needs, nobody will question you
- How did you gather user needs?
- Going through server logs
- search hits (internal search/google)
- where people can’t find what they want
- search hits (internal search/google)
- user research teams
- particularly when closing websites
- eg: travel advice
- provided reams of information
- what people really needed:
- they’re going travelling
- they just want a page to print out
- Going through server logs
- Do you have any metrics on how long it takes from code change to
prod deploy, and how many times you deploy each week/month/year?
- 15-20 deploys a day
- fast time-to-prod
- designing 10%, building 20%, corrections 70%
- Silos: Dev vs Ops
- “Agile”: Building then corrections (ie ditch design phase)
- Goal: lower defect rates
- Visibility
- meaningful data (relating to business value)
- state data (structured payload)
- heterogeneous (not just machine-level. everyone is involved)
- Alfred Korzybski - “The map is not the territory”
- we want:
- Better lifecycle
- informed decisions
- less maps
- more territory (or better maps)
- systems are (increasingly) complex
- back in 2000
- tech: apache/php/mysql
- visibility: cpu/memory/bandwidth
- now
- tech: haproxy/nginx/cassandra/redis/mysql/kitchen sink
- split over 27 nodes
- visibility: cpu/memory/bandwidth
- tech: haproxy/nginx/cassandra/redis/mysql/kitchen sink
- “how is my business doing?”
- “oh, about 6 Gbps”
- WAT
- what are the real key metrics?
- the ones that link back to the business?
- “oh, about 6 Gbps”
- metrics & events correlation - across:
- system
- components
- software (stuff we built & stuff we didn’t)
- lots of small producers, few big consumers
- STREAM ALL THE THINGS
- anything that happens or moves
- consumption:
- aggregate
- ratios/sums/averages/max
- correlate
- with eg deploy events, puppet runs etc
- decide
- track, alert, ignore, scale
- aggregate
- implementing this
- on premises, SaaS, in between
- no magic bullet
- on premises
- tech
- collectd
- logstash
- graphite
- statsd
- riemann (more complex, lives in statsd space)
- much more effort than SaaS
- tech
- the path to visibility:
- find key metrics
- find right tools
- event stream
- involve everyone
- challenge your mental model
- goal: improve quality, lower defect rates
- how do you identify key metrics?
- top-down
- if you’re a company, what are you selling?
- Thinking it’s only tooling
- also needs processes and culture
- culture is key
- Start with the wrong tooling strategy
- build your own solution
- fits requirements
- requires integration work
- deployment tool
- appliance images
- PaaS
- build your own solution
- Think only server deployment
- Configuration management
- orchestration
- Command FW
- puppet
- chef
- salt
- ansible
- cap
- fabric
- rundeck
- control tier
- cfengine
- func
- mcollective
- Trying to fully automate data and database upgrades
- Focus on rebranding as “DevOps”
- Don’t
- rename teams
- set DevOps as title or function
- create a new team
- Do
- forget the name
- empower your teams to foster the transition
- Don’t
- If you don’t measure it, you can’t value it
- define goals
- find metrics
- expose KPIs
- Forget CI loops
- devops is an endless journey
- set up a CI loop: tools and processes
- backlog -> improve -> feedback -> formalize and categorize
- Focus only on deployment process
- There are loads more areas where dev/ops collaboration makes sense
- #monitoringsucks
- common dashboards using ops, dev and biz metrics
- stop servercentric
- log management
- monitoring == test (cucumber-nagios)
- Performance management
- autoscaling
- scalability
- Incident/problem management
- bring dev into room when incidents are popping
- BCP/DRP design (what?)
- security
- conflict with ITIL
- Some prod environments do require strong service management
- change management
- puppet/chef + vcs
- release management
- deployment tools
- capacity management
- service continuity management
- notes to follow..
- cloud:
- the promise: automatic everything
- the reality: long, detailed, manual descriptions of the whole system
- Aeolus
- creating architecture through contraint solving
- O_O
- create toolbox
- appservers
- load balancers
- databases
- each thing requires
- creating architecture through contraint solving
- Zephyrus
- problems of undecidability
- http://www.aeolusproject.org/
This session was really interesting. It was run as a workshop by Pat. He divided us into groups of 4-5, and asked us to focus on the infrastructure of one of the group members. We mapped their infrastructure out on the wall using sticky notes in four steps:
First, we mapped out all of the infrastructure we had in our production environment, starting from how an end-user would enter our system (by convention, from the top of the diagram). Equipment such as firewalls, load balancers, application servers, databases of various kinds, message queues, mail servers, monitoring servers, puppet masters were all listed, along with their dependencies and communication channels.
Other things which were on this diagram were external dependencies such as mail-sending services (like Amazon SES), third party monitoring providers, and external APIs we connect to.
We were also encouraged to identify three key business processes, and mark those systems which are critical to those processes.
The second step was to map what the process was for a user to resolve a problem they were having. Who do they speak to, and how do things get escalated to team members?
This included roles such as helpdesk, sysadmin, DBA, developer, third-party suppliers (eg cloud provider support), third-party specialists (eg Oracle or Cisco specialists).
The third step was to map the infrastructure required to get code into production. Build servers, test environments, artefact repositories, that sort of thing.
The fourth and final step was to map the process and people involved in a code change by a developer is allowed to go through to production. BAs, QAs, product owners, change managers, sysadmins are all roles which might appear at this step.
The four diagrams were put on the wall in four quadrants of an overall diagram:
Deployment infrastructure | Production infrastructure |
---|---|
Development process | User issue resolution process |
There’s a nice pattern that has emerged here; you can see it like this:
Development infrastructure | Production infrastructure |
---|---|
Development process | Production process |
This exercise was really interesting, though probably too involved for a half-hour open space session as nobody got beyond discussing the right-hand two quadrants! I’d like to try this again sometime.
Oliver White presented a report from Rebel Labs ( http://zeroturnaround.com/rebellabs/ ) on a survey of various traditional ops and devops teams (available here, signup required: http://zeroturnaround.com/rebellabs/ops/it-ops-devops-productivity-report-2013/ )
It was divided into four parts:
- work week (ie what proportion of your week do you spend doing X?)
- failures and recoveries
- tools
- software releases
It was a really good session, though it probably isn’t fair to try to summarize it here as the real meat was in all the detail which I didn’t manage to capture.