Object Detection Training with Apple’s Turi Create for CoreML (2019 Update) : Ryan Jones

Taking a look at my last post about CoreML object detection, I decided to update the two part series with the latest Turi Create (now using Python 3.6). The original parts were about detecting an object in the camera frame (photo or video) and drawing a bounding box around it. This post builds on the previous two parts by adding detection for multiple objects, with iOS 12 – the Vision framework it makes it easier to find detected objects in Swift, and the training can all be completed in order of magnitudes faster with GPU support!

Installation
Image Setup
Training the Model
Mlmodel to Xcode
Conclusion
- GitHub Links

Installation

A lot of the setup is the same as it previously was, however there will be some updates to the runtime. The latest version of Python, that works with Turi Create 5.4, is 3.6.8 as of this writing. Download and install, and then follow the instructions in the GitHub repo about setting up the virtual environment to run and install Turi Create via pip.

Image Setup

This still uses the Simple Image Annotator from the previous post, this will generate a CSV file of all of the image annotations. It is recommended to use one folder for all the images, and then output to a single CSV, otherwise combining multiple CSV files into one file will be required. This step still uses Python 2, which is installed by default on macOS and can be used outside of the virtual environment created above.

Trying to keep this as simple as possible to update from the previous articles, source imagery for Object Detection should be kept in the same folder format.

training/
├── images/
│   ├── object/ <- named what you’re detecting
│   └── other_object/ <- what else you’re trying to detect
├── prep.py
└── train.py

This is slightly different from the previous iteration, this used to involve two steps:

create the annotations column (convert.py previously)
prepare the files for training using Turi

I simplified this into one step with prep.py since the annotations column requires the label (the object and other_object training labels above). Combining this into one step made more sense, and made it easier to not have to hardcode a training label name anymore. It will now use the subdirectory name from images/ (object and other_object above), and can prepare any number of objects for detection.

python prep.py input_file.csv

It now takes an input file location where you output data from Simple Image Annotator and uses Python 3.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# prep.py
import turicreate as tc
from turicreate import SFrame
from turicreate import SArray
import pandas as pd
import math
import sys

if len(sys.argv) < 2:
    quit("Require input file")

fileIn = sys.argv[1]

pathToImages = 'images'

def findLabel(path):
    # pull label from image path i.e. images/_label_/image.jpg
    return path.split('/')[1]

def annotation(row):
    # create annotation
    height = row['yMax'] - row['yMin']
    width = row['xMax'] - row['xMin']
    x = row['xMin'] + math.floor(width / 2)
    y = row['yMin'] + math.floor(height / 2)

    props = {'label': item['label'], 'type': 'rectangle'}
    props['coordinates'] = {'height': height, 'width': width, 'x': x, 'y': y}
    return [props]

# Load images
data = tc.image_analysis.load_images(pathToImages, with_path=True)

csv = pd.read_csv(fileIn, names = ["image", "id", "label", "xMin", "xMax", "yMin", "yMax"])
# From the path-name, create a label column
data['label'] = list(map(findLabel, data['path']))

# the data is in no particular order, so we have to loop it to match
annotations = []
count = 0
prev = 0
for j, item in enumerate(data):
    prev = count
    label = str(item['path'].split('/')[2])
    for i, row in csv.iterrows():
        if str(row['image']) == label:
            # match image name in path
            annotations.append(annotation(row))
            count += 1
            break
    if prev == count:
        # Figure out which item did not match
        print(item)

# make an array from the annotations data, matching the data order
data['annotations'] = SArray(data=annotations, dtype=list)

# Save the data for future use
data.save('training.sframe')

data['image_with_ground_truth'] = tc.object_detector.util.draw_bounding_boxes(data["image"], data["annotations"])

# Explore interactively
data.explore()

Lines 42-53 are where the magic happens. It creates the data needed for Turi Create to train the model. If all went according to plan, and the number of rows in the CSV match the number of images, an annotations column on the data object will have an array of updated detected objects. At the end of the prep.py run, you should also have a training.sframe directory with everything needed for training your model.

Training the Model

Nothing changed for the train.py script. Except now with Turi Create 5, they added GPU support. So instead of previously taking nearly 3 hours to train 1000 iterations. In my test with even more imagery, it took around 17 minutes. Nearly 10 times faster to train! So I bumped up the number of iterations to 2500, which takes about 42 minutes to train on my Radeon Pro 560 with 4 GB.

All that is needed is to set the modelName variable to whatever you want to use in Xcode.

python train.py
Setting 'batch_size' to 32
Using GPU to create model (AMD Radeon Pro 560)
+--------------+--------------+--------------+
| Iteration    | Loss         | Elapsed Time |
+--------------+--------------+--------------+
| 1            | 6.255        | 12.2         |
| 11           | 6.269        | 22.4         |
…
| 2500         | 0.734        | 2556.8       |
+--------------+--------------+--------------+

At the end of the train script, you should have a modelName.model folder and modelNameClassifier.mlmodel for use to drag and drop into your Xcode project.

Mlmodel to Xcode

Upgrading to Xcode 10, and iOS 12, the Vision APIs are more user friendly. We no longer have to do a lot of detection math that we did previously. The Vision API now returns an array of VNRecognizedObjectObservation objects. This makes detection so much simpler since each object has a boundingBox and matching label of what was found. The updates to the Xcode project are based on the sample project from Apple. The sample project does not walk through detection in a single image, but my GitHub repo for this use case does.

Setting up the camera for Vision detection, we need to do some setup. The updated code does this in more compartmentalized fashion.

setUpAVCapture() sets up the AVCaptureSession to use a camera (back by default) using Apple recommended methods, and adds the camera layer to the screen
setUpLayers() adds a CALayer for the camera view and drawing rectangles around the found objects over the camera layer
updateLayerGeometry() is from the Apple project to help the overlay rectangles when rotating the device
setUpVision() sets up the machine learned object detection using the Vision framework and provide a handler when it detects an image is detected

The main change with detection is how the detected objects are now a VNRecognizedObjectObservation object. This makes it easy to get the relevant information on screen.

1
2
3
4
5
6
7
8
9
10
11
12
for observation in results where observation is VNRecognizedObjectObservation {
    guard let objectObservation = observation as? VNRecognizedObjectObservation else {
        continue
    }
    // Select only the label with the highest confidence.
    let topLabelObservation = objectObservation.labels[0]

    let objectBounds = VNImageRectForNormalizedRect(objectObservation.boundingBox, Int(bufferSize.width), Int(bufferSize.height))

    let shapeLayer = layerWithBounds(objectBounds, identifier: topLabelObservation.identifier, confidence: objectObservation.confidence)
    detectionOverlay.addSublayer(shapeLayer)
}

We start off making checks that the results we get back from a detection are of the new type. The API has not changed, it returns an array of Any objects, for what I assume is backwards compatibility with an array of VNCoreMLFeatureValueObservation previously returned. So we need to be sure we are getting the iOS 12 only VNRecognizedObjectObservation objects from the results.

Line 6 we take the first object detected, which is the highest confidence of detection
Like 8 converts the detected bounding box from the found rectangle to the position to be displayed on top of your camera view, very important translation
Line 10 is a custom method to create a layer that will be added to denote where the bounding box of the detected object is

Detection using vision for two trained objects

The custom create layer method does more than just put a rectangle around the detected area. Based on the confidence or likelihood it detected one of the objects, dashed lines are under my threshold of 45%. Also colors are changed based on the object, so red is different than the cyan color.

The advantage to this new API from Vision is that we no longer have to determine the predictions ourself, removing this Turi Create step (or the method call to predictionsFromMultiDimensionalArrays) that we had to do previously.

Check the IoU (see Evaluation) between it and and all the remaining predictions. Remove (or suppress) any prediction with an IoU above a pre-determined threshold (the nmsThreshold we extracted from the meta data).

Conclusion

Turi Create has made some great advancements since it released in late 2017. Along with Apple’s updates to the Vision framework, training and detecting objects in images and live video is easier and faster than ever. I am excited to see what iOS 13 brings with regard to CoreML and machine learning, and if any updates would be able to make object detection training done through Xcode 11.

GitHub Links

Software versions used at the time of writing: Xcode 10.2 and Turi Create 5.4.