ARKit + RGB Sampling

I’ve been working on an ARKit app to paint with “pixels” floating in space. When the ARSession invokes its delegate method for ARFrame capture, I want to capture the colors the camera sees at the detected feature points. I then create a simple 3D box at that point with the sampled color and can then pan around the pixel. This is pretty neat when you scan someone’s face and then have them leave.

Anyway, that whole “sample the color of the camera’s captured image at an arbitrary 2D coordinate” turned out to be a dramatically more difficult problem than I had anticipated. Obstacles include:

  • Image is in CVPixelBuffer format.
  • The pixel buffer is in YCbCr planar format (the camera’s raw format), not RGB.
  • Converting individual samples from YCbCr to RGB is non-trivial and involves doing matrix multiplication.
  • There are several different conversion matrices out for handling different color spaces, just in case you wanted to convert an image captured off a VHS tape, I guess?
  • Apple’s Accelerate framework can do this conversion on the entire image very quickly, but the setup is quite complex and consists of invoking a chain of complex C functions. Once properly configured, it is spectacularly fast, converting an entire camera image in roughly 1/2 of a millisecond.
  • The Accelerate framework has not received much love since Apple’s switch to the unified documentation style last year: hundreds of functions appear nowhere in the documentation. The only way to figure out that they exist and how to use them is to browse the Accelerate header files, which are robustly commented.
  • Swift’s type safety is a big pain in the butt when you’re dealing with unsafe data structures like image buffers.

Setting up ARKit to display the “pixels” took about 2 hours (my first ARKit experiment and my first exposure to SceneKit). Getting the colors samples to color the pixels took about 2 days. I don’t feel like this learning process is anything that is particularly valuable for your average ARKit developer to master, so I’ve tidied it up and released it as a gist.

Check it out: CapturedImageSampler.swift

Usage: when your app receives a new ARFrame via the ARSession’s delegate callback, instantiate a new CapturedImageSampler with it. You are then free to query it for the color of a particular coordinate. I’m using scalar coordinates so that the sampling is scale-independent. If you want to find the color under a user’s tap, for instance, simply convert the x and y coordinates to scalars by dividing them by the screen width and screen height, respectively. When you’re done sampling (which must occur before the next frame arrives), simply discard the CapturedImageSampler by letting it go out of scope. Do not retain the sampler, use it asynchronously or pass it between threads. It should not live longer than the ARFrame that created it.

A word of warning: this object is not at all thread-safe due to the private use of a shared static buffer. I chose this implementation for maximum performance, since a new buffer does not need to be allocated for every frame received from ARSession. However, if you get into a situation where 2 instances of CapturedImageSampler are simultaneously attempting to access the shared buffer you will have a very bad day. If you need to have a thread-safe version of this, I suggest you make the rawRGBBuffer property non-static and add a “release” method that frees up the buffer’s memory when you’re done with it. Failure to manage this process correctly will result in a catastrophic memory leak that will get your app terminated within a couple of seconds.

Quick note on CoreML performance…

I just got done doing some benchmarking using the Oxford102 model to identify types of flowers on an iPhone 7 Plus from work. The Oxford102 is a moderately large model, weighing in around 229MB. As soon as the lone view in the app is instantiated, I’m loading an instance of the model into memory, which seems to allocate about 50MB.

The very first time the model is queried after a cold app launch, there is a high degree of latency. Across several runs I saw an average of around 900ms for the synchronous call to model to return. However, on subsequent uses the performance increases dramatically, with an average response time of around 35ms. That’s good enough to provide near-real-time analysis of video, when you factor in the overhead of scaling the source image to the appropriate input size for the model (in this case, 227×227). Even if you were only updating the results every 3-4 frames, it would still feel nearly instantaneous to the user.

From a practical standpoint, it would probably be a good idea to exercise the model once in the background before using it in a user-noticeable way. This will prevent the slow “first run” from being noticed.