Building an in-house IoT observability system
Stay ahead of the curve with trusted IoT expertise
BLOGS/ Best Practices / Observability

The #1 Thing to Consider When Building an In-House IoT Observability System

Stay ahead of the curve with trusted IoT expertise

Share

Engineers love to build things and solve problems—so it’s no surprise that many embedded teams decide to build their own in-house IoT observability system. Here are four critical factors to consider when you decide to build vs buy.

As Memfault’s VP of Engineering, I love talking with embedded teams about the in-house IoT observability systems their teams have cooked up. Are they doing over-the-air (OTA) updates out of an S3 bucket? Uploading crashes to Bugsnag? Sending metrics to Datadog? How are they comparing the health of the new product line against the last one?

Building an in-house IoT observability system requires solving impressive engineering challenges that span from bare metal to the cloud, and it’s a pleasure to discuss how their creative, bespoke, and sometimes beloved solutions compare against Memfault.

But the most interesting piece for me is to see if these teams have come to a key realization: no matter how much effort they’ve invested, building in-house observability means there is always a Sword of Damocles precariously balanced over the success of the product.

SwordOfDamocles

That’s because the more successful your product becomes, the more your corresponding observability system must be rapidly re-engineered, expanded, and re-thought to meet complex new requirements. In short: the success of your product is now bound to how well your in-house system—and the team building it—can keep up.

How This Realization Comes to Life

We often see this “point of realization” come to life with embedded teams that are very far along on the maturity journey of their tools. To illustrate, I’ll use the example of a successful financial services team Memfault works with.

The team was very mature and already had the following in place:

  • An OTA update system with support for various release channels and gradual rollouts
  • Proactive tracking of all types of metrics and logs, which they dumped (at high cost) into various systems
  • A full-time, cross-disciplinary team with a roadmap working hard to keep up with the business needs across the organization

Things Work Okay…Until They Don’t.

What does it feel like when the tools can’t keep up and the Sword of Damocles starts to fall?

The on-call system is exploding.

They have more customers than they can possibly speak to who are discovering the product is broken in ways they could have never imagined. These issues are creating not just “bugs,” but severe business “incidents.” These are not the sort of problems that are just discussed at a standup—they’re the kind that go into a shareholder report.

Engineers are struggling to keep up.

Embedded engineers are frantically brushing up their SQL to join all manner of different tables and figure out what’s in the data lake. They see some bad trends and pick a few device identifiers. They jump into different systems and pull different telemetry reports for those devices. They’re slowly building up a mental model of what happened on those devices over time and what could be behind the failures.

Tracking issues is nearly impossible.

But what about all the other reports from those devices: when did the issue start? How can they tie this data back into their release system to know how many devices it may impact? How can they check all the other reports not only from these devices, but all the other devices?

Dependencies are built.

What happens to the knowledge built up during this incident? Usually it is siloed within one or a few engineers. What happens when they leave or are on holiday? Other engineers have to slowly re-discover the same information again.

How well do your embedded engineers understand the database storing that data? How well do they understand the schemas necessary to make the queries? What happens to the bespoke SQL queries that are written?

The team is drowning in data.

At this stage, the team is realizing that having the data itself is not enough. In fact, that’s the easy part. Now they’re drowning in data from the massive number of devices they have in the field. They’ve realized these distinct systems actually need to be part of one composite, coherent system because at this scale, every second of customer support and engineering time counts.

Build vs Buy: What to Consider

Building in-house tools can be incredibly risky. Even at a very mature, enterprise stage, a huge sunk cost in internal tools isn’t a predictor of future success. It just means the stakes are higher.

Consider these factors when deciding to build:

1: Total cost of ownership

Building systems internally has significant costs in time, headcount, data storage, and data processing. Data housing and computing are a small part of the cost. Consider ongoing support costs for new and upgraded SDKs, chipsets, connectivity pathways, and metrics.

VDC’s 2024 State of IoT Software Development report found that engineers using third-party tools to collect device performance and health data saved 57% in overall project development costs versus those using in-house solutions. Many of our customers find that building and maintaining an internal system would easily exceed Memfault’s cost by 400% or more.

Related: ROI of Embedded Observability

2: Engineering time and skill sets

The financial services team in the example above valued one engineering year at approximately $500K. They knew they’d need multiple highly skilled and specialized engineers to make a well-working web app, with a database and s3 store, and bring up the infrastructure necessary to run the service. They would have needed to hire or borrow engineers from other teams to get the job done.

That takes a lot of engineering time away from product development as opposed to an individual firmware engineer to help with integration and configuration.

3: Scalability

Be sure to evaluate production-grade deployments. Small scale projects may hide true costs. Scalability and efficiency will impact infrastructure and administrative costs as well. 

4: Time to market

You’re likely to achieve value much faster with an out-of-the-box solution compared to the time required to build an in-house system. And getting a system in place sooner rather than later can actually speed up your production process. In fact, our IoT development report found that organizations using remote device health and performance monitoring solutions were 3x as likely to finish ahead of schedule as those not collecting data.

A lot of our customers told us they had an in-house observability system sitting in their backlog for months or years. Why wait to start taking advantage of the value?

What to Consider When Buying

Buying is also, clearly, not without risk. These systems are critical; is it really the right approach to solve that through outsourcing? The answer, of course, depends on the credibility, reliability, and ultimately the value of the solution that you’re looking at relative to your needs.

One of Memfault’s customers, Bond Home, had already built an in-house solution when they decided to switch. They were sending their crash data to AWS via MQTT and then using Sentry to visualize the information. But trying to scale their internal system was consuming significant resources and not keeping pace with development.

Maintaining our own debugging infrastructure requires constant maintenance and updates as we evolve our platform. On the other hand, Memfault just keeps investing and improving the platform while we are using it. Doing that ourselves would cost immense resources we are just not able to accommodate.”

Chris Merck | VP of Engineering | Bond Home

The best path is where you’re able to find the right solution: one that allows you to avoid the cost, time, and risk of building and maintaining the in-house solution from the beginning, and has already solved the hard scalability challenges so that you put your full focus on making your product successful.

For more information, check out our whitepaper, Build, Buy, or Blend: Optimizing IoT Project Performance in the Hyperscale Era. It explores key considerations when selecting tools and systems for your IoT development stack, along with common challenges organizations face when building in-house solutions for monitoring and updating device fleets.

See How We Can Help

If you’re interested in learning more, please get in touch! Feel free to message me on LinkedIn with questions about IoT observability systems. After all, for me and many others here at Memfault, speaking to teams about these challenges is one of the best parts of the job.

You can also schedule a custom demo with our team or jump into our interactive sandbox to see how Memfault works.

Related Posts

STAY AHEAD OF THE CURVE

Subscribe for industry trends, advice, and success stories

Trusted expertise for IoT business leaders and development teams