One gigabyte of data for the grocery bag. You get it when the robotic delivery. That’s a lot of information – especially if you repeat it more than a million times like us.
But the rabbit hole goes deeper. The data is incredibly diverse: robot sensors and image data, user interactions with our app, order-to-transaction information, and more. And everything from deep neural network training to creating polished visualizations for our merchant partners and everything in between is equally diverse.
So far, we have been able to deal with all these complexities with our central data team. So far, the ever-increasing growth has managed to find new ways to work to keep us moving.
The best way to move forward is to find data fake instances. I’ll describe below about Starship’s data mesh, but first, let’s have a brief summary of the process and why we decided to go with it.
What is data fraud?
The data mesh structure was first described by Jhamak Dehghani. The paradigm relies on the following key concepts: data products, data domains, data platforms, and data governance.
The main purpose of the data mesh framework was to help large organizations overcome the barriers to data engineering and deal with complexity. It therefore addresses many of the details relevant in an enterprise setting, from data quality, architecture and security to governance and organizational structure. As it stands, only a handful of companies have publicly announced compliance with data fraud instances সমস্ত all large multi-billion dollar initiatives. Nevertheless, we think it can be successfully applied to small companies as well.
Data fraud at Starship
Does data work for people who produce or consume information?
In order to run a hyperlocal robotic delivery marketplace around the world, we need to introduce a variety of data into valuable products. Information is coming from robots (such as telemetry, routing decisions, ETAs), merchants and customers (their apps, orders, offers, etc.) and all functional aspects of the business (from short remote operator work to global supply data). Parts and robots).
Diversity in usage is the main reason that has attracted us to data mining methods – we want to produce data or work very close to the users. By following the data counterfeiting policy, we hope to meet the diverse information needs of our team while keeping reasonably light under central supervision.
Since Starship is not yet on the enterprise scale, it is not practical for us to implement all aspects of the data network. Instead, we are focused on a simplified approach that makes sense to us now and puts us on the right path for the future.
Define what your data product is – with each owner, interface and users
Applying product thinking to our data is the basis of the whole approach. We think of something that reveals information to other users or processes as a data product. It can publish its information in any format: a BI dashboard, a Kafka subject, a data warehouse view, a predictive microservice response, and so on.
A simple example of a data product in Starship could be a BI dashboard for the site that tracks the business volume of their site. A more detailed example would be a self-serving pipeline for robot software engineers to send any information related to driving from a robot to our data lake.
However, we do not consider our data warehouse (actually a Databrix Lakehouse) as a single product, but as a platform that supports several interconnected products. These granular products are usually owned and maintained by data scientists / engineers, not dedicated product managers.
The product owner is expected to know who their users are and what needs they are addressing with the product – and based on that, define and live up to the quality expectations for the product. Perhaps as a result, we have begun to pay more attention to the interface, elements that are critical to usability but laborious to change.
Most importantly, it makes it much easier for users to understand and prioritize the ideas that each product is creating for them. This is critical in an initial context where you need to move quickly and not have time to perfect everything.
Group your data products into domains that reflect the organizational structure of the company
Before we became aware of the data counterfeit model, we successfully used its format Lightly embedded data scientist In the starship for some time. Practically, a data team member of some core team was working part-time with them যা whatever that means within a specific group.
We proceeded to define the data domain in keeping with our organizational structure, this time taking care to cover every part of the company. After mapping the data product to the domain, we hired a data team member to curate each domain. This person is responsible for overseeing the entire set of data products in the domain – some owned by the same person, some by other engineers on the domain team, and even by some other data team members (e.g. due to resources).
There are several things we like about our domain setup. First and foremost, now one person in each area of the company oversees its data architecture. In terms of the inherent subtlety of each field, it’s just because we’ve shared the work.
Creating structures in our data products and interfaces has helped us better understand the data world. For example, in more domain-oriented situations than data team members (currently 19 vs. 7), we are now convinced that we are each working on an interrelated topic. And we now realize that in order to alleviate the growing pain, we should reduce the number of interfaces used across domain boundaries.
Finally, the more subtle bonus of using a data domain: we now feel that we have a recipe for dealing with all sorts of new situations. Whenever a new venture comes along, it is very clear to everyone where it is and with whom it should be run.
There are also some open questions. While some domains tend to disclose source data naturally and others tend to accept and convert it, there are some that both have fair amounts. Should we split up if these become too big? Or should we have a bigger domain? We have to take these decisions to the streets.
Empower people to create your data products with quality without centralization
The goal of the data platform at Starship is straightforward: it makes it possible to take care of a single data person (usually a data scientist) from the end of the domain, i.e. keeping the central data platform team out of the day-to-day work. This requires domain engineers and data scientists to provide good tooling and standard building blocks for their data products.
Does this mean that you need a complete data platform team for your data fraud method? Not really. Our data platform team consists of a single data platform engineer, who in parallel binds their half time to a domain. The main reason we are so risky in data platform engineering is the choice of Spark + Databrix as the core of our data platform. Our earlier, more traditional thematic data warehouse architecture placed a significant data engineering overhead on us due to the diversity of our data domains.
We found it useful to make a clear distinction between the data stack between the components of the platform and everything else. Here are some examples of what we provide to domain groups as part of our data platform:
- Databrix + Spark as a work environment and a versatile calculation platform;
- One-liner function for receiving data, such as from Mongo collection or Kafka subjects;
- An airflow example for determining data pipelines;
- Templates for creating and deploying predictive models such as microservices;
- Data product cost tracking;
- BI and visualization tools.
As a general approach, our goal is to standardize as much as we can understand in our current context – even the bits we know will not be standardized forever. As long as it doesn’t help productivity at the moment, and doesn’t centralize any part of the process, we’re happy. And of course, some elements are currently completely missing from the platform. For example, data quality assurance, data discovery and tooling for data generation are things we have for the future.
Strong personal ownership supported by feedback loop
Having fewer people and parties is actually an asset in some aspects of governance, such as making decisions is much easier. On the other hand, our core governance question is also a direct result of our size. If every domain has a single data person, not every one of them can be expected to be technically expert. However, they are the only people who have a detailed idea about the domain. How can we maximize the chances of making good choices within their domain?
Our Answer: Through a culture of ownership, discussion, and feedback within the team. We have generously borrowed from the management philosophy on Netflix and cultivated the following:
- Personal responsibility for the outcome (someone’s product and domain);
- Asking for different opinions before making a decision, especially those that affect other domains;
- Seeking both feedback and code review as a qualitative process and an opportunity for personal growth.
We’ve made a few specific deals about how we move towards quality, our best practices (including the naming convention), and so on.
These policies also apply outside of our data team’s “building” work – which was the focus of this blog post. Clearly, there is more to our data scientists than just providing data products on how companies are creating value.
One final thought about governance – we will keep repeating the way we work. There will never be a single “best” way to do things and we know we have to adapt to the times.
The final word
This is it! These were the 4 original data fake ideas applied to Starship. As you can see, we have found a method of data networking that suits us as an impressive growth-level organization. If you find this interesting in your context, I hope reading about our experience has been helpful.
If you would like to join our work, see our career page for a list of open positions. Or check out our YouTube channel to learn more about our world-leading robotic delivery service.
If you have any questions or concerns, contact me and learn from each other!