Kubernetes in Production: Services

Published in

okmeter.io blog

5 min readDec 4, 2018

We migrated all of our services to Kubernetes about six months ago. At first glance, the task seemed quite simple: deploy a cluster, write application specifications, and that’s it. But, since we’re obsessed with stability, we nevertheless had to learn how k8s works under pressure, so we tested multiple failure scenarios. Most of the questions that arose were network related. One particular point of concern was how Kubernetes Services function.

The documentation says:

Roll out an application Deployment
Configure liveness and readiness probes
Create a Service
Conduct stress tests: workload balancing, failure handling, etc.

However, when it comes to real-life scenarios, everything gets a bit more complicated. Let’s take a closer look.

A quick intro to k8s

Assuming that whoever reads this is already familiar with the basics of Kubernetes and its terminology, let’s quickly cover what’s involved with k8s Services. (Technical terms will be in italic.)

A Service is a k8s entity which defines a set of Pods and methods to access them.

Let’s launch an application:

apiVersion: apps/v1 
kind: Deployment 
metadata:   
  name: webapp 
spec:   
  selector:     
    matchLabels:        
      app: webapp   
replicas: 2   
template:     
  metadata:        
    labels:          
      app: webapp     
  spec:       
    containers:       
    - name: webapp         
      image: defaultxz/webapp         
      command: ["/webapp", "0.0.0.0:80"]         
      ports:         
      - containerPort: 80         
      readinessProbe:         
        httpGet: {path: /, port: 80}           
        initialDelaySeconds: 1           
        periodSeconds: 1

Here’s how we get it up and running:

$ kubectl get pods -l app=webapp 
NAME                      READY     STATUS    RESTARTS   AGE 
webapp-5d5d96f786-b2jxb   1/1       Running   0          3h 
webapp-5d5d96f786-rt6j7   1/1       Running   0          3h

We can see two Pods running by setting ‘replicas: 2’. To access Pods, you need to create a Service that defines which Pods you want to access and specify the ports:

kind: Service 
apiVersion: v1 
metadata:   
  name: webapp 
spec:   
  selector:     
    app: webapp   
  ports:   
  - protocol: TCP     
    port: 80     
    targetPort: 80$ kubectl get svc webapp 
NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE webapp    ClusterIP   10.97.149.77   <none>        80/TCP    1d

Now you can access this HTTP app from any machine in the cluster:

$ curl -i http://10.97.149.77 
HTTP/1.1 200 OK 
Date: Mon, 24 Sep 2018 11:55:14 UTC 
Content-Length: 2 
Content-Type: text/plain; charset=utf-8

How it works

Here’s what’s going on, in simple terms:

A user runs a kubectl apply for the Deployment specification.
Some k8s magic happens; it’s not important what’s behind the curtains at this stage.
As a result of this magic, functioning application Pods will appear on some nodes.
Every so often, a kubelet (a k8s node agent) checks the liveness and readiness of all the Pods launched on its node.
After the probes are done, the kubelet sends the results to the k8s apiserver (interface to the k8s ‘brain’).
The Kubernetes apiserver sends notifications to the kube-proxy on each node about all changes to the Service and its Pods.
The kube-proxy pushes all the changes to the underlying proxying subsystem (iptables, IPVS, etc.) configuration.

By default the proxy uses iptables. Here is an example iptables rule:

-A KUBE-SERVICES -d 10.97.149.77/32 -p tcp -m comment --comment "default/webapp: cluster IP" -m tcp --dport 80 -j KUBE-SVC-BL7FHTIPVYJBLWZN

All 10.97.149.77 traffic goes to a KUBE-SVC-BL7FHTIPVYJBLWZN chain, where it’s divided into two other chains, corresponding to our two Pods:

-A KUBE-SVC-BL7FHTIPVYJBLWZN -m comment --comment "default/webapp:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-UPKHDYQWGW4MVMBS -A KUBE-SVC-BL7FHTIPVYJBLWZN -m comment --comment "default/webapp:" -j KUBE-SEP-FFCBJRUPEN3YPZQT

From there, it gets routed to the final destination — our Pods (at the IPs 10.244.0.10 and 10.244.0.11):

-A KUBE-SEP-UPKHDYQWGW4MVMBS -p tcp -m comment --comment "default/webapp:" -m tcp -j DNAT --to-destination 10.244.0.10:80 -A KUBE-SEP-FFCBJRUPEN3YPZQT -p tcp -m comment --comment "default/webapp:" -m tcp -j DNAT --to-destination 10.244.0.11:80

Testing a Pod failure

I made a test webapp that switches to an “error rush” mode, which generates multiple errors. You need to make an “/err” request to turn errors on.

Here are the results we got for the command ‘ab -c 50 -n 20000’ while turning on the “/err” mode on one Pod in the middle of a test:

Complete requests:      20000 
Failed requests:        3719

The important thing here is not the number of errors (which is proportional to the workload), but the fact that any occurred at all. At some point, the “bad” or “erroneous” Pod was excluded from balancing, but the client (‘ab’ in this case) still received a lot of errors. The reason is fairly easy to explain: the kubelet checks the readiness every second (configured in readinessProbe.periodSeconds), and it takes time to notify all kube-proxies about a certain Pod failure in order to exclude it from traffic balancing.

Is the IPVS kube-proxy backend (experimental) any better?

Not really! It optimizes proxying and adds custom balancing algorithms, but it doesn’t speed up Pod failure notifications.

What can be done?

This problem can only be solved by a balancer that includes a request retry mechanism. In other words, we need a OSI Layer 7 (L7) load balancer that can work with HTTP traffic. Such balancers are already used in Kubernetes in the form of ingress controllers. Originally, they were designed to function as cluster entry points, but it turned out that these controllers suited perfectly for our L7 balancing needs. Another alternative is to use a service mesh; for example, istio.

For our production, we adapted neither an ingress controller nor a service mesh due to the added complexity involved. In my opinion, such abstractions help when you often need to reconfigure a large number of services. However, at the same time, you sacrifice simplicity of control and infrastructure, and you’ll have to spend extra time figuring out how to properly configure retries and timeouts for each and every service.

Our approach

We ended up using so-called “headless Services.” They don’t have virtual cluster IPs that proxy traffic to Endpoints, so kube-proxy and iptables do nothing. What these headless Services provide is just a way to group Pods and a way to get a list of live Pods, either through DNS or k8s API.

We do this by adding a “sidecar container” with an Envoy (L4/L7 proxy) to any application that interacts with other services. The Envoy periodically receives an updated list of Pods for a Service via DNS, but most importantly, it can retry requests that failed due to some Pod error.

As an alternative to sidecar deployment, you can run Envoy as a DaemonSet on each node. The downside is, if any of them fail, all applications on the corresponding node that rely on it will stop working.

Since this proxy is really not that resource-hungry, we decided to go with the sidecar container method. Essentially, we reimplemented what a service mesh like istio does, but in our case, we achieved better simplicity (i.e., we didn’t have to learn istio and deal with its particular bugs). Still, though, we may change our minds someday and start using istio or similar solutions.

We adapted Kubernetes in okmeter.io, and we provide k8s monitoring in our service — you can get it now!