Ringface: AI Face recognition – csabameszaros.com

In the previous blog, we looked at how to access the Ring device, and how to download video recordings from the device at ring events. In todays post, we will look into the possibilities to process these video files, extract the faces, and collect them into a library of known faces. This classifier is mathematical linchpin of the solution. As always, you can jump to the ready application at GitHub.

The component responsible in our architecture for these tasks is the classifier.

First of all credits, where credits due: the recognition module is a wrapper around dlib ml open source library developed by Davis King, and the python module that bundles this dlib routine with the pre-trained AI model binaries from Adam Geitgey. As these gentlemen did a great job documenting and explaining the underlying modules, I will focus here on the usage of the modules for our purposes.

Face recognition in a single image

While the following simplification of single image processing will not be directly used in the final solution of video processing, it is worth starting at the basic blocks.

def recognition(personImageFile, dirStructure = DEFAULT_DIR_STUCTURE, clf = None, fitClassifierData = None):

    result = ImageRecognitionResult(personImageFile)

    if clf is None:
        clf, fitClassifierData = clfStorage.loadLatestClassifier(dirStructure.classifierDir)

    image = face_recognition.load_image_file(personImageFile)

    # Find all the faces in the test image using the default HOG-based model
    face_locations = face_recognition.face_locations(image)
    no = len(face_locations)
    logging.debug(f"Number of faces detected: {no}")
    encodings = face_recognition.face_encodings(image, face_locations)

    # Predict all the faces in the test image using the trained classifier
    for i in range(no):
        encoding = encodings[i]
        name = clf.predict([encoding])

        knownFaceEncodings = findKnownFaceEncodings(name, fitClassifierData)
        if commons.isWithinToleranceToEncodings(encoding, knownFaceEncodings): 
            logging.info(f"Recognised: {name}")
            result.addPerson(name[0])
            
    return result

Line (5, 6) will read in the persisted classifier, into a sklearn.svm.SVC (Support Vector Classifier) instance, and I will explain later, what this classifier is, and how we constructed it. In (11) we trigger the detection of the face regions in the image: mind that there may be more than one person in the image. Here the underlying HOG algorithm will compile a Histogram of Oriented Gradients, to assess, if this may be a rectangular region in the image containing a face. The output of this section are coordinates in the image, which can be cropped to face thumbnails.

We could visualise the output of this step as finding this thumbnail

In this image

The next step is to turn this thumbnail into something that can be classified (tagged with name). Line (14) does that, in that for the thumbnail it outputs a vector or 128 real number encodings.

[
          -0.09314524382352829, 0.11986921727657318, 0.07301906496286392,
          -0.053206998854875565, -0.0017003938555717468,
          -0.00014687329530715942, -0.11579475551843643, -0.06327946484088898,
          0.18962357938289642, -0.14366395771503448, 0.27862706780433655,
          0.07541991025209427, -0.2088695466518402, -0.1254536211490631,
          0.06680457293987274, 0.13503627479076385, -0.23173490166664124,
          -0.07490985095500946, -0.0925355851650238, -0.05511965602636337,
          0.0026444904506206512, 0.01187570858746767, 0.06607645750045776,
          0.038094669580459595, -0.09664979577064514, -0.38529786467552185,
          -0.0816865861415863, -0.14535865187644958, 0.013874398544430733,
          -0.1952625960111618, -0.09412668645381927, -0.012078091502189636,
          -0.16904473304748535, -0.099558986723423, -0.006558932363986969,
          0.01563914492726326, -0.00731620192527771, -0.005515456199645996,
          0.18617157638072968, 0.047550201416015625, -0.13289299607276917,
          0.06521280109882355, 0.016453102231025696, 0.23716051876544952,
          0.24895066022872925, 0.10694848746061325, 0.0010272432118654251,
          -0.056838780641555786, 0.13377857208251953, -0.21948237717151642,
          0.0685833990573883, 0.16816368699073792, 0.10303550958633423,
          0.0482356920838356, 0.12589949369430542, -0.19619816541671753,
          -0.024759799242019653, 0.0691722109913826, -0.0838712751865387,
          0.06163743510842323, 0.04663514345884323, -0.07281285524368286,
          0.020316604524850845, 0.04898398742079735, 0.17329366505146027,
          0.03426067903637886, -0.10656487941741943, -0.051974933594465256,
          0.10832326114177704, -0.028278352692723274, -0.012372013181447983,
          0.014338754117488861, -0.17965441942214966, -0.22849386930465698,
          -0.24183648824691772, 0.06967386603355408, 0.3295048177242279,
          0.1943269819021225, -0.21061471104621887, 0.018246904015541077,
          -0.1473742127418518, -0.03225572407245636, 0.05533575266599655,
          0.07602326571941376, -0.021285507827997208, -0.07160190492868423,
          -0.0783371850848198, 0.055873408913612366, 0.10203529894351959,
          0.03376571834087372, 0.003750983625650406, 0.20189157128334045,
          -0.03460335358977318, 0.06303180009126663, -0.0007806364446878433,
          0.016734786331653595, -0.160550057888031, -0.028142400085926056,
          -0.17982521653175354, -0.07206931710243225, -0.010999158024787903,
          -0.007000971585512161, -0.018162399530410767, 0.11174336075782776,
          -0.18205730617046356, 0.090810626745224, 0.02806062251329422,
          -0.044966183602809906, -0.02320147305727005, 0.09496608376502991,
          -0.06461364030838013, -0.05141502618789673, 0.07340114563703537,
          -0.21723119914531708, 0.22918827831745148, 0.26179876923561096,
          0.029728425666689873, 0.16665227711200714, 0.08108121901750565,
          0.05614861845970154, -0.004576832056045532, -0.020393550395965576,
          -0.13926322758197784, -0.09659240394830704, 0.015614379197359085,
          0.07863864302635193, 0.07327841222286224, -0.008407235145568848
        ]

What do these 128 numbers precisely measure on the face? The specifics aren’t our primary concern. What truly matters is that the model produces almost identical numbers when analyzing two distinct images of the same individual. Actually, it is this model that is considered the AI part of the solution, implemented in the DLIBs face_recognition_model_v1.

To identify a person’s name from their encoding, the final step is straightforward: match this image encoding array with the closest measurements in our known database of encodings. This can be achieved using basic machine learning techniques, without the need for advanced deep learning. We’ll utilize a simple linear SVM classifier, but other methods could also be effective. Essentially, we train the classifier to determine which known individual’s measurements are most similar to a new test image. Within milliseconds, the classifier provides us with the individual’s name. To learn more about the SVM best refer to this Wikipedia.

There is one catch, as the SVM will always predict some name/tag, as the nearest vector in the euclidian space. This is solved by applying a maximum distance, on line (22).

Back to classifier training

As mentioned before, in order to run the above code, we need to train our SVM classifier first, to be able to load it in line (6). This training will basically fill the 128 dimensional space with vectors of known faces, and assign a name to each face (vector). As an input for this, an array of the above vector mapped to a nam must be provided. To compile such input for the classifier, we will use the well known components:

HOG algorithm to extract faces thumbnail from input images
face_recognition_model_v1 for extracting the vector from the thumbnail
manual assignment of name to this vector

def fitClassifier(fitClassifierRequest, dirStructure):

    encodings = []
    encodingLabels = []

    for personImages in fitClassifierRequest['persons']:
        personImages['encodings'] = []
        for imagePath in personImages['imagePaths']:
            try:
                encoding = encoder.encodeImage(imagePath)
                encodings.append(encoding)
                encodingLabels.append(personImages['personName'])
                personImages['encodings'].append(encoding.tolist())
            except helpers.MultiFaceError as err:
                logging.warn(f"MultiFaceError on {imagePath}")

    clf = svm.LinearSVC()
    clf.fit(encodings,encodingLabels)

    clfStorage.saveClassifierWithRequest(clf, fitClassifierRequest, dirStructure.classifierDir)

    return fitClassifierRequest, clf

The input data (fitClassifierRequest) will be structured the following

{
"persons":[
    {
        "personName": "Barack Obama",
        "imagePaths": [
            "./sample-data/images/barack/new-images/barack1.jpeg",
            "./sample-data/images/barack/new-images/barack2.jpeg"
        ]
    },
    {
        "personName": "Donald Trump",
        "imagePaths": [
            "sample-data/images/donald/new-images/donald1.jpeg",
            "sample-data/images/donald/new-images/donald2.jpeg"
        ]
    }
]
}

The output, saved to the passed output dir is the following JSON, and the binary representation of the state of the clf from line (18)

{
  "persons": [
    {
      "personName": "Barack Obama",
      "encodings": [
        [
          -0.09314524382352829, 0.11986921727657318,...
        ],
        [
          -0.0871209129691124, 0.17638660967350006, ...
        ]
      ]
    },
    {
      "personName": "Donald Trump",
      "encodings": [
        [
          -0.10035578906536102, 0.190534308552742, ...
        ],
        [
          -0.08303076773881912, 0.08978970348834991, ...
        ]
      ]
    }
  ]
}

Now we have all the data and all the input to run the classification, and subsequently the face recognition. In the next step we fill facade this with a REST Api (Flask server), and extend the logic to processing streams of ring videos.

Creating the API

We will again use the Flask server to expose the python methods as REST endpoints. Refer to Ringface: Accessing the Ring Device on how we set up Flask.

The server will expose the following endpoints

@app.route('/recognition/local-video', methods=["POST"], )
def recognitionLocalVideo():
    event = request.json
    videoFilePath = event['videoFileName']
    videoRecognitionResult = singleVideo.recognition(videoFilePath, dirStructure, clf, fitClassifierData, event)
    return Response(videoRecognitionResult, mimetype='application/json')

@app.route('/classifier/fit', methods=["POST"])
def classifier():
        
    fitClassifierRequest = request.json
...
    fitClassifierData, clf = fitClassifier(fitClassifierRequest, dirStructure)
    res = jsonify(fitClassifierData)
    parseEncodingsAsNumpyArrays(fitClassifierData)
    return res

As we can see, the two main methods of this module are the singleVideo.recognition and fitClassifier, which we already discussed above.

The video recognition will follow the following steps:

load the latest classifier
load the video file to process
iterate over frames of video
run the single image recognition on the frame

While this naive extension to video processing already yields results, there are significant optimisations which we can consider:

Ring captures videos with relatively high frame rate, and since there is not much motion after the person rang the bell, it is waste of computer resources to process all the frames. We expose a config to define how many frames to jump, and recommend processing each 4th frame.
Most of the computers would offer 4-8 kernels, so we do not need to process frames sequentially. Taking into consideration the limitation of python parallel programming, we start 4 (configurable number) processes, to process 4 frames in parallel.

With this in mind, lets show the source code, again available here on GitHub.

def recognition(videoFile, dirStructure = DEFAULT_DIR_STUCTURE, clf = None, fitClassifierData = None, ringEvent= None):

    personCounter = 1

    logging.info(f"processing input video {videoFile}")
    result = VideoRecognitionData(videoFile)

    if clf is None:
        clf, fitClassifierData = clfStorage.loadLatestClassifier(dirStructure.classifierDir)

    input_movie = cv2.VideoCapture(videoFile)
    length = int(input_movie.get(cv2.CAP_PROP_FRAME_COUNT))
    logging.info(f"Total frames: {length}")

    frame_counter = 0
    noFaceFrameCounter = 0

    extractionResults = []

    faceFound = False

    logging.debug(f"frame 0 - {config('MAX_FRAMES')}: scheduling for extraction ")

    # process the math heavy part in parallel
    with mp.Pool(processes= config('PARALLELISM', cast=int)) as pool:
        while True:
            
            frame_got, frame = input_movie.read()

            frame_counter += 1
            
            if not frame_got:
                break
            
            if frame_counter > config('MAX_FRAMES', cast=int):
                logging.warn(f"will not consider more than first {config('MAX_FRAMES', cast=int)} frames")
                break

            if frame_counter > config('MIN_FRAMES', cast=int) and frame_counter % config('EACH_FRAME', cast=int)  != 0:
                # logging.debug(f"frame {frame_counter}: skipping ")
                continue




            # Convert the image from BGR color (which OpenCV uses) to RGB color (which face_recognition uses)
            start_time = time.time()
            image = frame[:, :, ::-1]
            extractionResults.append(pool.apply_async(extractFromImageParallel, (image, frame_counter)))


        # pool.close()


        # process the results sequentially
        for res in extractionResults:
            frame_counter, face_locations, encodings, image = res.get()
            # logging.debug(f"frame {frame_counter}: postprocessing")

            facesCount = len(face_locations)
            logging.debug(f"frame {frame_counter}: Number of faces detected: {facesCount}")

            # stop after couple of empty frames
            if facesCount == 0:
                noFaceFrameCounter += 1
                if faceFound and frame_counter > config('MIN_FRAMES', cast=int) and noFaceFrameCounter >= config('STOP_AFTER_EMPTY_FRAMES', cast=int) :
                    logging.warn(f"frame {frame_counter}: noFaceFrameCounter reached: {noFaceFrameCounter}. Stopping")
                    # at this point we do not need more 
                    pool.terminate()
                    break
                else:
                    continue

            noFaceFrameCounter = 0


            for i in range(facesCount):
                encoding = encodings[i]
                
                #process the recognised face
                if clf is not None:
                    start_time = time.time()
                    name = clf.predict([encoding])
                    logging.debug(f"frame {frame_counter}: predicted name: {name}")

                    # if commons.isWithinTolerance(encoding, encodingsDir):
                    knownFaceEncodings = findKnownFaceEncodings(name, fitClassifierData, frame_counter)
                    if commons.isWithinToleranceToEncodings(encoding, knownFaceEncodings):    
                        logging.info(f"frame {frame_counter}: Recognised: {name}")
                        faceFound = True
                        result.addRecognisedPerson(name[0])
                        continue
                    else :
                        logging.debug(f"frame {frame_counter}: Face outside of tolerance for {name}")

                # unknown face processing
                # do not process too small faces
                if faceTooSmall(face_locations[i], frame_counter):
                    continue

                top, right, bottom, left = face_locations[i]
                logging.debug(f"frame {frame_counter}: "+"The unknown face is located at pixel location Top: {}, Left: {}, Bottom: {}, Right: {}".format(top, left, bottom, right))
                thumbnail = image[top:bottom, left:right]
                pilThumbnail = PIL.Image.fromarray(thumbnail)
                faceFound = True

                # if logging.getLogger().level == logging.DEBUG:
                #     pilThumbnail.show()

                similarPerson = result.findSimilarPerson(encoding)
                if similarPerson is not None:
                    logging.debug(f"frame {frame_counter}: {similarPerson} in the frame {frame_counter}")
                    result.addToPerson(similarPerson, pilThumbnail, encoding)
                else: 
                    newPersonName = f"unknown-{personCounter}"
                    personCounter += 1
                    logging.info(f"frame {frame_counter}: New {newPersonName} in the frame {frame_counter}")
                    result.addToPerson(newPersonName, pilThumbnail, encoding)

    logging.debug("wait for the workers to terminate")
    pool.join()

    # saveResultAsRun(result, dirStructure.recogniserDir)
    if ringEvent is not None:
        saveResultAsProcessedEvent(result, dirStructure, ringEvent)

    return result.json()

With these building blocks in place, we can now embark on manually processing the first ring video. I have bundled a command line starter with the project, wich you can use, to run the video processing.

python3 startRecogniserSingleVideo.py ./sample-data/downloaded-video.mp4

Better even, continue with the next post, on how to create an Angular Frontend, a JS orchestrator backend, and a Mongo database to bundle the newly created APIs to a runnable solution.