Looking at how Tesla and Comma curate driving datasets

February 22, 2026

It’s remarkably difficult to wrap your head around the sheer scale of modern data systems. Of course, given access to so much data, the question quickly becomes how you can glean insights from it. I was recently watching Ashok’s talk at Scaled ML 2026 and Mitchell Goff’s talk at COMMA CON 2025 where they present how these two very different companies (despite being planted in the same domain) ingest driving data and determine what segments would best serve in training their models.

Comma pulls “1000+ hours of data / day”, although it’s unclear whether this is “out of the 10,000 hours that people are driving with openpilot”, which would imply that these segments only reflect when openpilot is active [4:15-4:26]. Regardless, there is a degree of selectivity applied - a significant one too, approximately 0.1. Mitchell describes the process:

  1. the comma device records segments (video + sensor data) in one minute chunks
  2. the device uploads a heavily quantized/compressed version of this data (a qlog) to the cloud
  3. some algorithm discerns whether the whole segment should be fetched based on the qlog
  4. data is fetched, if applicable

On the other hand, Tesla operates at a vastly different scale with the ability to capture “500 years of driving data every single day” [10:04]. Ashok doesn’t go into technical detail about how they distill the 500 years of data into some smaller set, but they both discuss the principles used to detect “interesting” segments (although Mitchell’s talk focuses more on later stages, like train/test split). Tesla, for one, has been very vocal about autonmous driving’s “long tail” (Ashok has tweeted, “The long tail is sooo long, that most people can’t grasp it.”). In practice, this means that in order to create safe autonomous driving systems, real world data is needed and this data must be varied enough to capture extreme scenarios beyond a daily work commute, for instance.

In another talk at COMMA CON 2025, Harald Schaefer shows how Comma’s infra supports the ability to parse through downloaded segments to find interesting cases, specifically events where the car breaks hard [20:50]. Although the code isn’t explicit, I imagine that this is an example of using sensor data (check out examples of comma’s rlogs) to find anomalies as opposed to using video. It’s important to note that this is after the data has already been deemed “useful” (wrt the algorithm above).

Ashok mentions other methods to discern interesting driving data: tiny NNs for semantic object detection, user intervention during autonmous operation, and any large changes in state space (i.e., hard breaks). The detail here is unfortunately obscured - are they also looking at a distilled version of the full record? Is detection staged into tiers to reduce transmission over the network? Nonethless, I think comparing the principles of the two approaches outlines how the dataset curation flow might look. It is not feasible to collect everything and, better yet, doing so would likely overfit the model to mundane data. Comma’s datacenter is pretty lean for the magnitude that they’re operating at, so as Ursula says in DHL’s Women in Love, “One must discriminate” – data collection is no exception.