Posted on May, 2021
An overview of things to take into account to scale successfully
So you have a nice working website, things are good, life is beautiful and birds are singing, then the marketing department decides to do some marketing stuff and they say traffic will increase about three times what’s normal.
What does the infrastructure team do? A classic: multiply current server capacity by four, just in case.
Then the day comes, yes, THAT day, you think you are prepared, but nonetheless you couldn’t sleep well last night, you are worried, you are full of “what if’s” in your head, the company didn’t invest in a performance stress test, but you know that when the marketing team says three times more traffic, it will be more like two times more traffic, and the infrastructure team multiplied the infra by four times, so all should be well right?
It could end well, or it could not…
At netlabs sometimes we get called well before THAT day, sometimes only a week before, sometimes DURING that day, sometimes after.
This two part article will talk about some of our experiences and tips to increase your chances of sleeping better, continue to have a beautiful life and keep hearing the birds singing, instead of executives and investors looking at your team with scary faces, strange tones or worse!
The first thing should be to build an exact infrastructure as your production one, with the same amount of resources.
This is easier if you are currently using the Infrastructure as Code paradigm or some other way that allows you to quickly clone the current production infra and build a stress test one.
After some initial tests, try to use a tool like JMeter or a third party service to build something with as much resemblance as possible of your users behavior using the website.
If when you run it simulating the current real amount of users, you see very similar usage of all your infrastructure components, then you are on the right path, well done!
But sometimes, getting it right could be very complex and time consuming though, try to use tools that allow you to analyze your real users behaviors and not what you think that your users do with your website.
You might be surprised at how differently your users use your website in comparison to how you think they do.
If you need to simulate a lot of traffic, you need multiple JMeter servers and need to tune them accordingly (remember the file descriptors, garbage collector, kernel network settings, etc.), you need to make sure that the JMeter farm isn’t itself the bottleneck.
That can be avoided if you hire the right third party service instead of running them yourself, of course.
Sometimes though, the amount of traffic to simulate is so high, that the costs to do the simulation and the infrastructure to support it are so high, that is simply prohibitively expensive; other times, you don’t have too much time in advance to do it.
Anyway, there are some other things you can do, keep reading!
No, measuring CPU usage and a couple other things is not enough!
Be sure to have the RIGHT metrics and be SMART about them.
Yes, overall CPU usage by server is important and a must have, but do you also have a max single core CPU usage?
Some services don’t scale well to multiple cores, like for example Redis (and yes, with enough traffic and network bandwidth, you can also saturate Redis in a single CPU core and prevent your nice website to scale).
Other times, a service is misconfigured and not all cores of the server are being put to use, exhausting only some of the cores available.
Those cases don’t show up when you look at overall CPU usage of a server, because it is the average between all cores, you NEED the maximum single core usage too.
Conversely, if you have a server cluster, max CPU usage of a single server is important too in addition to average CPU consumption of the group, otherwise some users would get too slow responses.
If you only measure disk idle time, if it gets near zero you know that the disk is bottlenecking you and yes, knowing that it’s a good thing.
So you decide to buy faster disks, but faster at what? Faster at IOPS? Faster at throughput?
Measure disk idle time, IOPS and throughput (both read and write), that way you know if you need a disk that supports more throughput or more IOPS, if you need a disk that it’s good at reads and not so good at writes, etc. and with that, buy smarter.
Measure not only bandwidth, but also packets per second and drops too, both at servers and networking equipment (switches).
Know their limits, are they bottlenecking you?
Having some latency (ICMP ECHO / ping) statistics to some destinations could also uncover periods of saturation.
Measure amount of connections both incoming and outgoing and TCP Listen Overflows (connections rejected by kernel because the service backlog queue overflowed, meaning, the service couldn’t accept more requests).
Too many active connections or listening overflows means you may be missing connections even if your logs and other metrics look right, time to adjust some kernel settings, service settings or adding more servers altogether.
How does the RAM usage increase in relation with the amount of connections?
Are you allowing a number of connections that could exhaust the RAM available?
If so, serialize connections: put the right amount of simultaneous connections in order to not exhaust your RAM and use TCP SYN backlog (your kernel) to try to not reject connections that could be served serially in an adequate amount of time.
Sometimes 128 connections queued is too small, tune your kernel and applications to accept more so you don’t fail in sudden and short traffic spikes; the kernel structures don’t take too much memory so you can be generous, 10k connections queued is no problem even on small servers.
Sometimes RAM may seem enough, but we have seen several cases of too small Java Heap, causing too many garbage collector stalls: watch those old generation major garbage collection stats.
And another case of RAM that may seem enough but it isn’t: too many disk reads that could be avoided: if your working set is not too big, then having enough RAM to cache it and avoid the disk reads will speed up your service, sometimes in a huge way.
You may estimate intuitively the working set or you may try one server with lots of RAM and see how much is in cache, then reduce RAM to an amount adequate to keep that cache size, or you may use blktrace with some scripts to measure exactly your working set size.
Are your web servers and other services logging the time it took to answer, do you have metrics of them? Too many 5xx 's?
In our experience at netlabs, many customers have those local metrics but totally forget about the external services consumed!
Everything in your infrastructure and services may be looking just right but still you see long response times, where is the time spent then?
Yeah… that third party you must call on too many places on your website… yep, that one... MEASURE their times, status codes and set right your timeouts for their responses, don’t let them saturate your maximum amount of active connections.
See if they increase their response times as you have increased traffic, that may indicate that on marketing day, they will saturate and start to time out, so let them know in advance!
Even if you don’t use microservices or serverless architectures, monolithic applications can be profiled, and even when using microservices, profiling them could help pinpoint exactly where time is being spent.
If you have an APM, that's great, but even if you don’t have one, there are many profiling tools that even work in production environments without slowing them too much.
Ok so in this first part we already shared some tips about what and how to measure, in the second and final part, we will see an overview of common bottlenecks and roadblocks with some tips and tricks to solve them.
Hope that you enjoyed it and found some of the tips useful!