Running autonomous robots on city streets is a software engineering challenge. Some of the robots in this software run by themselves but a lot of it actually runs in the backend. Issues like remote control, path finding, robot matching with customers, fleet health management but also communication with customers and merchants. All of this requires running 24×7, without interruptions and dynamically matching workloads.
SRE at Starship is responsible for providing cloud infrastructure and platform services to run these backend services. We turned on Governors For our microservices and running it above AWS. Mongodib Most backend services are basic databases, but we like that too PostgreSQL, Especially where strong typing and transaction guarantees are required. For asynchronous messaging Kafka The preferred messaging platform and we are using it in almost everything except robot video stream shipping. We rely on observation Prometheus And Grafana, Loki, Left And Jagger. Managed by CICD Jenkins.
A good portion of SRE time is spent on maintaining and improving Kubernetis infrastructure. Kubernets is our main deployment platform and always has something to improve, be it fine tuning autoscaling settings, adding pod disturbance principles or optimizing the use of spot examples. Sometimes it’s like laying bricks – just installing a helm chart to provide special functionality. But often “bricks” should be carefully picked and evaluated (Loki is good for log management, service is a fake thing and then that) and sometimes functionality doesn’t exist in the world and has to be written from the beginning. When this happens we usually go to Python and Golong but rust and c if necessary.
Another major infrastructure for which SRE is responsible is data and databases. Starship started with a single monolithic Mongodibi – a strategy that has worked well so far. However, as business grows we need to revisit this architecture and start thinking about supporting thousands of robots. Apache Kafka is part of the scaling story, but we also need to figure out the shading, regional clustering and microservice database architecture. On top of that we are constantly developing tools and automation to manage the existing database infrastructure. Example: Add MongoDb monitoring with a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate regular failover and recovery tests, collect metrics for Kafka re-sherding, enable data retention.
Finally, one of the most important goals of site reliability engineering is to reduce downtime for starship production. Although SREs are sometimes called upon to deal with infrastructure disruptions, more effective work is being done to prevent fragmentation and ensure we can recover quickly. This can be a very broad topic, ranging from rock solid K8s infrastructure to engineering practice and business processes. There are great opportunities to make an impact!
A day in the life of SRE
Arriving at work, some time between 9 and 10 (sometimes working from a distance). Grab a cup of coffee, check slack messages and emails. Review the warnings that were issued during the night, see if we have anything interesting there.
Find that the MongoDb connection delay has increased overnight. Dig into the Prometheus matrix with Grafana to see if this is happening during the backup. Why is this suddenly a problem, we’ve been running those backups for ages? It turns out that we’re compressing backups very aggressively to save network and storage costs, and this is consuming all available CPUs. It seems that the load on the database has increased a bit so that it is noticeable. This is happening on a standby node, does not affect production, but is still a problem, should the initial fail. Add a cumin item to fix it.
When passing, change the Mongodib Prover Code (Golong) to add more histogram buckets for a better understanding of listing distribution. Run the Jenkins pipeline to create new probes.
There’s a standup meeting at 10am, share your updates with the team and find out what others are doing – set up monitoring for a VPN server, create a Python app with Prometheus, set up a service monitor for external services, debug MongDB connection issues To do, run the canary deployment with the flag.
After the meeting, resume the work planned for the day. One of the plans I have made today is to place an additional Kafka cluster in a test environment. We are running Kafka in Cubernets so it should be easy to take existing cluster YAML files and tweak them for new clusters. Or, second thought, should we use helms instead, or maybe a better Kafka operator is available now? No, not going there – too much magic, I want more clear control over my statefulset. Raw YAML it. An hour and a half later a new bunch is going on. The setup was fairly straightforward; Only init containers that register Kafka brokers to DNS require a config change. A short bash script is required to set up accounts in Zookeeper to generate certificates for applications. Kafka Connect was set up to capture database change log events – it appears that the test databases are not running in replica set mode and Debzium cannot get out of it. Backlog this and move on.
Now is the time to prepare a scenario for the practice of the wheel of misfortune. At Starship we run this to improve our understanding of systems and share problem solving strategies. It works by breaking down some parts of the system (usually tested) and some unfortunate people try to fix the problem and fix the problem. In this case I will set up a load test Hey To overload the microservices for route calculations. Use it as a Kubernets work called “Heimecker” and hide it well enough so that it is not immediately visible on the LinkedIn service network (yes, evil). Then run the “wheel” exercise and note any gaps we have in the playbook, metrics, alerts, etc.
In the last few hours of the day, block all obstacles and try some coding. I re-applied the Mongoproxy BSON parser as asynchronous (Rust + Tokyo) streaming and wanted to find out how well it works with real data. There is a bug somewhere in the parser courage and I need to add deep logging to get it out. Find a wonderful tracing library for Tokyo and move on …
Disclaimer: The events described here are based on a true story. Not all happened on the same day. Some meetings and interactions with colleagues have been edited. We are hiring.