Serverless Functions Post-Mortem

sizeoftheuniverse@programming.dev · 1 year ago

Serverless Functions Post-Mortem

christophski@feddit.uk · 1 year ago

It’s mind boggling that having an easy to use local environment wasn’t the first thing cloud providers did

UFO@programming.dev · 1 year ago

Not mine boggling imo when you think of it from the angle of “then they’ll have to spend more money!”

Otoh I had an argument with an AWS rep who just didn’t understand why I wanted an isolated local dev environment.

TehPers@beehaw.org · edit-2 1 year ago

As someone who’s worked a lot with Azure Functions, the experience for me in Visual Studio has always been:

Create C# function app
Write the code
Hit F5

The Functions runtime can be ran locally as a standalone as well, and I was able to get Rust function apps working locally using a custom handler. There’s also a vscode plugin to run them.

Things might be different for Lambdas/GCP’s Functions?

christophski@feddit.uk · 1 year ago

I’m thinking more about being able to run your entire environment locally. We use GCP and we have a combination of appengine, cloud run and cloud functions tied together with api requests and pubsub. The cloud functions are the main bit missing from our local environment as we’ve not been able to spend the time to set it up yet.

snowe@programming.dev · 1 year ago

Hm. this is a very strange article for me to read because in my experience, only 1 or 2 things in the whole article have been true for our company (17k employee company with 300 people in the tech org).

Users would ingress through the API Gateway technology, which handles everything from traffic management, CORS, authorization and API version management. It basically serves as the web server and framework all in one. Easy to test new versions with multiple versions of the same API at the same time, easy to monitor and easy to set up.

We don’t use API Gateway. The best use for lambdas is as a direct call, using the ARN. You don’t need to worry about CORS, permissions, etc. You either have access to call the lambda or you don’t. You can directly control exactly who can call your service, and you never need to set up IAM at all.

Local development. Typically a developer pulls down the entire application they’re working on and runs it on their device to be able to test quickly. With serverless, that doesn’t really work since the application is potentially thousands of different services written in different languages. You can do this with serverless functions but it’s way more complicated.

I think if you’re literally recreating your monolith in a lambda then you’re doing something fundamentally wrong. Our entire team only has a few lambdas (10-15) and they’re very easy to manage. But yes, testing locally is an issue. We’ve solved this with testcontainers, which would be the same solution if you were just deploying docker services to k8s or openshift or even directly to a VM. This is the first very large issue with lambdas that the article is correct about though.

Hard to set resources correctly. How much memory did this function need under testing can be very different from how much it needs under production. Developers tended to set their limits high to avoid problems, wiping out much of the cost savings. There is no easy way to adjust functions based on real-world data outside of doing it by hand one by one.

I do not understand this. How are your resources changing like that? We’ve only had to touch the resources for our very large functions, and even then we’ve touched them only once or twice in 3 years. This is absolutely a non-issue. Set it to the lowest to start, then when it times out update it to the next level. It really isn’t difficult.

Since even a medium sized application can be made up of 100+ functions, this is a non-trivial thing to do.

please. why in the world would you think a ‘medium sized application’ would have 100+ functions? That’s absolutely insane. there’s no way to manage that. That’s not using serverless properly. We have a medium sized ‘application’ (250k+ lines of kotlin, with tens of millions of lines of generated Java from Drools Rules) and it’s 10-15 lambdas. You should not have hundreds of lambdas for a medium sized app. That’s just idiotic, I’m sorry, but that was never what serverless was meant for.

Is it working? Observability is harder with a distributed system vs a monolith and serverless just added to that. Metrics are less useful as are old systems like uptime checks. You need, certainly in the beginning, to rely on logs and traces a lot more. For smaller teams especially, the monitoring shift from “uptime checks + grafana” to a more complex log-based profile of health was a rough adjustment.

I can agree with some of this partially, but I’m not sure why the author thinks that getting rid of uptime checks is a problem. I’ve never once had to worry about whether our lambdas are up. There’s no uptime! It either works every time or it doesn’t work at all. It’s pretty awesome actually. Of course you do need to test it when you deploy, but that’s a simple http call and boom you know whether the deploy worked or not.

Traces are also a great way of tracking where you have slowness in your system. I’m guessing a lot of this depends on which ‘ecosystem’ you choose, but with Quarkus and XRay, tracing is dead simple. Add a dependency, you’ve got tracing. Done.

Now, the big problem here is the error passing, which the author talks about later.

Latency. Traditional web frameworks and containers are fast at processing requests, typically hitting latency in database calls. Serverless functions were slow depending on the last time you invoked them. This led to teams needing to keep “functions warm.”

well sure, but if you have 100+ functions then you’re multiplying your instantiation costs by 100+. This is not a non-issue for fewer lambdas, but it’s much less of a problem.

Later Provisioned Concurrency was added, which is effectively…a server. It’s a VM where your code is already loaded. You are limited per account to how many functions you can have set to be Provisioned Concurrency, so it’s hardly a silver bullet. Again none of this happens automatically, so its up to someone to go through and carefully tune each function to ensure it is in the right category.

correct. but it’s a server you don’t have to manage. I don’t know why the author calls this out this way, you have to manage autoscaling servers at a much finer grained level. Provisioned Concurrency is literally “how many functions do you want to be running at any point in time by default”. There’s not much else to it, besides the next point.

But it is very possible for one function to eat all of the capacity for every other function. Again it requires someone to go through and understand what Reserved Concurrency each function needs and divide that up as a component of the whole.

this is a major issue. no other thing to say about it. I do not understand why this is the case, but yes it’s a huge problem, and makes scaling across an org very difficult. The one solution to this is to split your org into separate aws accounts (not sure how gcp manages it, but we do use gcp too), which helps with that, but it’s still a weird restriction.

In addition, serverless functions don’t magically get rid of database concurrency limits. So you’ll hit situations where a spike of traffic somewhere else kills your ability to access the database. This is also true of monoliths, but it is typically easier to see when this is happening when the logs and metrics are all flowing from the same spot.

Hm. maybe this depends on using RDS, because we’ve never seen this with Dynamo.

In practice it is far harder to scale serverless functions than an autoscaling group. With autoscaling groups I can just add more servers and be done with it. With serverless functions I need an in-depth understanding of each route of my app and where those resources are being spent. Traditional VMs give you a lot of flexibility in dealing with spikes, but serverless functions don’t.

This has not been our experience. Lambdas have been simple set it and forget it, allowing our team (and company) to focus on the business rather than infra. We only spend time configuring lambdas when we are creating new ones, which isn’t too often. It’s been 3 years since we started using lambdas, and I would say we create maybe 3-5 a year and they’re all for new features, not new individual functions. The companies business has grown more than 3x in that time and we have daily spikes and it all just works.

Teams switched from always having a detailed response from the API to just returning a 200 showing that the request had been received. That allowed teams to stick stuff into an SQS queue and process it later. This works unless there is a problem in processing, breaking the expectations from most clients that 200 means the request was successful, not that the request had been received.

this is by far the most annoying thing about lambdas. they are http under the covers, but you can’t modify any http headers, response codes, etc. It’s either ‘throw an exception’ or ‘200’. nothing in between. very annoying.

snowe@programming.dev · 1 year ago

continued

Functions often needed to be rewritten as you went, moving everything you could to the initialization phase and keeping all the connection logic out of the handler code. The initial momentem of serverless was crashing into the rewrites as teams learned painful lesson after painful lesson.

we have never encountered this. this is probably exacerbated by the fact that the author thinks having 100+ lambdas for a medium sized app is normal. you focus even more on the startup time, rather than solving business problems.

Price. Instead of being fire and forget, serverless functions proved to be very expensive at scale. Developers don’t think of routes of an API in terms of how many seconds they need to run and how much memory they use. It was a change in thinking and certainly compared to a flat per-month EC2 pricing, the spikes in traffic and usage was an unpleasant surprise for a lot of teams.

Combined with the cost of RDS and API Gateway and you are looking at a lot of cash going out every month.

lambdas are saving us tens of thousands of dollars a month because we don’t need to worry about massive monoliths and the required ec2 autoscaling instances needed, nor the insane costs of RDS.

The other cost was the requirement that you have a full suite of cloud services identical to production for testing. How do you test your application end to end with serverless functions? You need to stand up the exact same thing as production.

why do you need this? That’s not how most testing works. you mock what you need. Unless you’re using a monolith then this applies to any architecture.

Traditional applications you could test on your laptop and run tests against it in the CI/CD pipeline before deployment.

if by traditional you mean monoliths. Any sort of microservices, or even a slightly macroservices architecture.

Serverless stacks you need to rely a lot more on Blue/Green deployments and monitoring failure rates.

but why? this isn’t explained. We haven’t seen this. Maybe it’s how we use lambdas, but we use versioned lambdas and we deploy and immediately forget about it. There’s nothing to maintain about old versions, rollbacks are automatic.

Slow deployments. Pushing out a ton of new Lambdas is a time-consuming process. I’ve waited 30+ minutes for a medium-sized application. God knows how long people running massive stacks were waiting.

why are you ‘pushing out a ton of new lambdas’? The whole point is for things to be self contained. If you are needing to touch multiple things often then your lambdas should be a single thing, not multiple. This comes back to the ‘100+’ lambdas thing. That’s just bad design. Don’t blame lambdas for this.

We are able to build GraalVM Kotlin lambdas in less time than that, along with the deploy. The slowest part is literally the CDK synthesis. If we were using CF yaml then it would be half the time.

Security. Not running the server is great, but you still need to run all the dependencies. It’s possible for teams to spawn tons of functions with different versions of the same dependencies, or even choosing to use different libraries. This makes auditing your dependency security very hard, even with automation checking your repos. It is more difficult to guarantee that every compromised version of X dependency is removed from production than it would be for a smaller number of traditional servers.

This is going to completely depend on your team, your languages, and the frameworks you’re using. For us, it’s dead simple to keep up to date. Snyk helps us, it’s one click deploy for each lambda, and we can send to prod immediately due to having a very mature CI/CD pipeline. We are getting even better at this as we’ll be switching to gradle’s version catalogs which means that all of the applications can use the exact same version catalog and it will then require a single change whenever we need to update stuff, instead of hundreds.

The complexity of running a server in a modern cloud platform was massively overstated. Especially with containers, running a linux box of some variety and pushing containers to it isn’t that hard. All the cloud platform offer load balancers, letting you offload SSL termination, so really any Linux box with Podman or Docker can run listening on that port until the box has some sort of error.

so now you have to maintain your linux security, your autoscaling on linux, your deployment pipelines for linux, your nginx configs for linux, or if you’re using k8s you have to learn two stacks, k8s and AWS. If you’re adding loadbalancers then you should be deploying those with CDK anyway so now you’re both using cdk and k8s along with maintaining security on your ec2 instances.

Setting up Jenkins to be able to monitor Docker Hub for an image change and trigger a deployment is not that hard. If the servers are just doing that, setting up a new box doesn’t require the deep infrastructure skills that serverless function advocates were talking about. The “skill gap” just didn’t exist in the way that people were talking about.

O___o we literally have an entire infra team that is unable to manage Jenkins to the level that devs need due to how difficult it is to maintain a jenkins build pipeline. Not only that, but now you’re dependent on maintaining security for jenkins which is, and has always been, a nightmare. Jenkins pipelines aren’t testable locally (github actions you can use Act along with something like mock-github and act.js to even test your pipelines as part of ci/cd!). We’re currently switching the entire org to github actions due to how terrible jenkins is. And then you’re writing more pipelines to do monitoring! The author claims that serverless has more monitoring, but then goes on to say that you can ‘simply set up’ all this other stuff which is wayyyy harder to maintain in the long run.

People didn’t think critically about price. Serverless functions look cheap, but we never think about how many seconds or minute a server is busy. That isn’t how we’ve been conditioned to think about applications and it showed. Often the first bill was a shocker, meaning the savings from maintenance had to be massive and they just weren’t.

100+ lambdas once again.

Really hard to debug problems. Relying on logs and X-Ray to figure out what went wrong is just much harder than pulling the entire stack down to your laptop and triggering the same requests.

I don’t know what the author is doing, but it’s so dead simple to run lambdas locally and test locally that I really really don’t understand this. There’s only a single entrypoint. You know what the request was going in and out. It takes me way less time to debug something in a lambda than it ever did in a monolith (I’ve worked in a lot of monoliths and we still maintain a monolith on my team). If you have 100+ lambdas then maybe you should start blaming your architecture, rather than lambdas. It would be the exact same if you had 100+ microservices…a nightmare.

I am very sorry, but I honestly read through the whole article and agreed with a lot of it, and then when I went to write this up just got angrier and angrier because it’s very very clear that the author has a terrible architecture and is blaming it on lambdas. Lambdas don’t work for everything. In general don’t use them for web servers! But for a great solution to small self contained applications, or an architecture that might need one side to scale differently than others, or for step functions where you’re a state-flow diagram, the list goes on and on… then it’s a fantastic solution.

nibblebit@programming.dev · 1 year ago

Man, I have to agree. Your write up reflect my experience with Azure Functions in a mid-large sized application way more than the post. Fantastic

TehPers@beehaw.org · 1 year ago

I’d even go further with Azure Functions and say that running them locally is really simple. Of all the issues I’ve had with them, running them locally was never an issue.

Sigmatics@lemmy.ca · 1 year ago

Classic example of the Hype Cycle.

People need to figure out the perfect use cases for the tech first, and in this case it was being applied to way too many.

interolivary@beehaw.org · edit-2 1 year ago

I’ve done some work with “actually” distributed systems (as in gossip protocols and self-organizing networks and logical clocks and blah), so I was fairly skeptical of the promises of serverless functions right from the start, because many of these pitfalls are – like the article notes – pretty classic stuff in distributed systems.

Honestly it strikes me as hilarious that someone would think “hey, you know what would make this system easier to operate? Making it distributed!”. We’ve seen the same thing with microservices; endless boondoggles with network interaction graphs that look like they could summon Cthulhu, brittle codebases where you can break services you didn’t even know existed by changing something in your API you thought was trivial, junior (and sometimes even more senior) coders writing stuff that doesn’t take into account eventual consistency (or database write consistency levels, or what happens if the service dies in the middle of operation X, or or or or or…) and that breaks seemingly at random, etc. etc.

Not that I’m saying it’s impossible to get these things right. Definitely not, and I’ve seen that happen too. It just takes much more work and skill, a well planned infrastructure, good tracing, debugging and instrumentation capabilities, painless local dev envs etc. etc., and on top of all the “plumbing” you need several people with a solid understanding of distributed systems and their behavior, so they can keep people from making costly mistakes without even realizing it until production.

MagicShel@programming.dev · 1 year ago

Oddly enough I’d say this exact same thing about monoliths. Except for one thing: you’re right that applications are easier to implement and operate as monoliths, but they are easier to manage and maintain as microservices. So in a way, the question is one of perspective. If you want shit done today, write a monolith. It’s you want that shit to continually operate and grow for a decade or more, write microservices.

interolivary@beehaw.org · 1 year ago

Oh I have nothing against microservices as a concept and they can be very maintainable and easy to operate, but that’s rarely the outcome when the people building the systems don’t quite know what they’re doing or have bad infra.

Monoliths are perfectly fine for many use cases, but once you go over some thresholds they get hard to scale unless they’re very well designed. Lock free systems scale fantastically because they don’t have to wait around for eg. state writes, but those designs often mean ditching regular databases for the critical paths and having to use deeper magics like commutative data structures and gossiping, so your nodes can be as independent as possible and not have any bottlenecks. But even a mediocre monolith csn get you pretty far if you’re not doing anything very fancy.

Microservices give you more granular control “out of the box”, but that doesn’t come without a price, so you need much better tooling and a more experienced team. Still a bit of a minefield because reasoning about distributed systems is hard. They have huge benefits but you really need to be on your game or you’ll end in a world of pain 😀 I was the person who unfucked distributed systems at one company I worked in, and I was continuously surprised by how little thought many coders paid to eg. making sure the service state stays in a “legal” state. Database atomicity guarantees were often either misused or not used at all, so if a service had to do multiple writes to complete some “transaction” (loosely) and it died during rhe write, or a database node died and only part of the writes went through to the master, and maybe suddenly you’re looking at some sort of spreading Byzantine horror where nothing makes sense anymore because that partially completed group of writes has affected other systems. Extreme example, sure, but Byzantine faults where a corrupted state spreads and fucks your consensus are something you only see in a distributed context.

MagicShel@programming.dev · edit-2 1 year ago

Yeah so that’s one place I sort of sacrifice microservice purity. If you have a transaction that needs to update multiple domains then you need a non-microservice to handle that IMO. All of these rules are really good rules of thumb, but there will always be complexity that doesn’t fit into our perfect little boxes.

The important thing is to document the hell out of the exceptions and do everything you can to keep them on the periphery of business logic. Fight like hell to minimize dependencies on them.

But that’s true of any architecture.

If it’s not clear I pretty much agree with everything you say here.

1stTime4MeInMCU · 1 year ago

Never seen so much truth in one article. 90% of applications would be fine as small VMs running monoliths. Dev time is an expensive resource compared to VMs and the simplicity promised just isn’t there. And having tech companies that run the major cloud platforms also be the software evangelists that herald “the new best way” of doing development was always a conflict of interest.

That being said, FaaS is nonetheless a useful tool in the toolbelt for the odd app that does actually need crazy scale to 1000000 scale back to 0, or for certain kinds of simple apps. Traditional app development still rules the middle space when it comes to team productivity.

CoderKat@lemm.ee · 1 year ago

I totally agree that most servers work best as monoliths. Though at the same time, every now and then there’s a case that really needed a microservice and you’ll regret not having started that way, cause migrating a monolith that was never designed to be anything but a monolith can be really hard.

I have one of those. A server that is so large, complicated, and contributed to by so many different teams that it takes a lot of extra work to safely release and debug issues. Honestly, the monolithic structure does still make it easier to understand as a whole. It’s not like splitting the server up would make understanding the end-to-end experience any easier (it would definitely become more complicated). But releasing such big servers with so many changes is harder, especially since users don’t care about your architecture. They want it to work and they want 100% uptime. A bigger server means more to verify correctness before you can release it and when something is incorrect, you might be blocked on some other team fixing it.

MagicShel@programming.dev · 1 year ago

Maybe it’s because my career has been based around microservices for the past few years, but I don’t think the need for microservices is as narrow as many folks think. At least within a large company it’s as much about segregating lines of concern and responsibility as it is about speed and efficiency. It’s a lot easier and cheaper to spin up new hardware than it is to manage and coordinate all the varied interests in a monolith.

You point out the problems of a monolith that has grown beyond the ability to effectively manage it, but every application only grows (until it is replaced). I think we are in agreement other than you minimize the usefulness more than I would.

My experience is every monolith either grows until it is so full of tech debt that it can’t be maintained effectively any more, or it gets cloned over and over with minor variations and you wind up with huge messes of rewriting the same code over and over necessitating a massive development undertaking when a new business need comes along - and then the whole shebang gets shit-canned to be replaced by a new product.

Properly architected microservices segregate concerns and make huge efforts easier to do in small units and often simultaneously. It doesn’t have to be this way. It’s fair to say that only is a problem with poorly architected monoliths, but my experience is bad architecture always creeps in or is never fixed because it works well enough for now. The forced segregation is inefficient, and frustrating as hell for juniors, but at the project management level it’s a huge boon.

Just my perspective after twenty five years. But as I say, I’m heavily invested in microservices and don’t claim to be unbiased. Monoliths have their place, but I think businesses that are serious about their software need microservices.

1stTime4MeInMCU · 1 year ago

My hot take is that unmaintainable monoliths result from poor system design / too strong coupling. If you can’t cleave off portions of your monolith without breaking it you built it wrong in the first place, and the choice between monolith and microservice isn’t going to save you. Perhaps starting with a microservice forces people to make (or at least consider) better design choices from the beginning but 1. there is no reason you can’t make those same architectural decisions with a monolith and 2. you can still strongly couple microservices with poor design.

Getting back to basics of “what makes for good application development” using good abstractions, Go4 patterns, SOLID / KISS / DRY / etc, means that whether your threads are running colocated vs on another VM vs on another box vs in another datacenter vs in another continent shouldn’t affect you much. If your app breaks in ways beyond “I wonder if moving this job to another system means we’ll suffer from memory nonlocality or sync latency” the walls are already closing in lol.

MagicShel@programming.dev · 1 year ago

Yeah I agree with all of that. Architecture is hard and you don’t normally fully understand what you need until you’ve built it the wrong way.

Neo@lemmy.hacktheplanet.be · 1 year ago

That was an insightful, well written article. Good read, thanks for sharing.

Neo@lemmy.hacktheplanet.be · 1 year ago

Connect Lemmy client for Android seems buggy and kept posting my comment multiple times. Posting this reply from Jerboa, hopefully less buggy.

GrumpyOldMan@programming.dev · 1 year ago

I feel these pains daily. I also have a few senior engineers who are still drinking the Serverless Kool-aid (ex-AWS people).

AWS CDK could solve some of these problems but the platform isn’t really there yet. We have a few apps where we can run testing in CI or do local deploy/debug via Localstack. But setting that up was a massive pain, much more so than just running an old-school app in a debugger.

argv_minus_one@beehaw.org · edit-2 1 year ago

Why were so many smart people wrong?

They weren’t. The smart people are people like you, not these hype train conductors.

Neo@lemmy.hacktheplanet.be · 1 year ago

deleted by creator

Neo@lemmy.hacktheplanet.be · 1 year ago

deleted by creator

exu@feditown.com · 1 year ago

Site doesn’t load for me unfortunately :/

cestvrai@lemm.ee · 1 year ago

https://archive.is/90dur

Neo@lemmy.hacktheplanet.be · 1 year ago

deleted by creator