Deep learning-based video analysis
by Mohd Irfan | 4th April  2023
5 min Read
We are aware that Netflix is using AI to generate personalised thumbnail pics for its TV and movie content, but we think this article is the first example of using AI to create cover pics in a creator platform.  Also, we have open sourced the source code for our implementation on GitHub.
Creating video thumbnails can be time consuming for creators, and this is particularly true if they are shooting content live and then immediately moving it to an on-demand catalog.  Peloton is an example of this, with over 10,000 classes that were created live, all with similar looking ( but different ) cover pics - showing the instructor smiling at camera.  Peloton has the resources to do this manually, but for many local creators recording Zoom classes, finding and uploading cover pics has to be done after the event and it’s onerous.  Our eureka moment was saying “surely AI could do a good enough job of this ?”.  And this blog is about our experiences using AI to select thumbnail cover pics from video recordings.
So to fill the gaps of such use cases, we needed an AI service that is able to detect 4-5 frames from a stored video of a live streaming feed/uploaded manually, and the person in those frames should be looking at the camera, their eyes should be open, and they should be smiling.  
To break it down : 
  1. Once the staff has uploaded the video or Zoom is done with recording, the AI/ML service should begin searching for the facial attributes mentioned above.
  2. It should return the response along with the grabbed frames after analyzing.
  3. Return frames need to be cropped in a variety of sizes and shapes ( circular, rectangular in 16:9, 4:3, 1:1 ratios ).
After doing some investigation, we learned that there isn't a service that can perform all of these tasks.  Both the Amazon Rekognition service from AWS and the Video Intelligence API from Google are capable of detecting faces and other attributes.  Amazon’s Rekognition service returns the detected metadata attributes in the response and does not return the grabbed frames, whereas Google’s Video Intelligence API does return the grabbed frames but they are so small that they were not helpful to us.  See it yourself below - 
We looked at both Rekognition service and Video Intelligence API from a possible implementation point of view and here are our observations.  After that, we’ll show actual side by side results we got from the two AI services for the videos we tested, and which one was better for our needs.
AWS’s Amazon Rekognition
Amazon Rekognition is a service that analyzes images/videos, finds and compares faces and returns the detected attributes in response. It only works with videos stored on S3 or streaming via Kinesis and the video must be encoded using the H.264 codec while  the supported file formats are either MPEG-4 or MOV. It looks to see if the input contains a face. If so, Amazon Rekognition finds the face in the image/video, examines the facial landmarks of the face, such as the position of the eyes, and detects any emotions (such as appearing happy or sad). It then returns a percent confidence score for the face and the facial attributes that are detected in the image. Actually, it looks for the provided meta data in each frame and returns the result in response, so the more frames there are in the video the more time it is going to take to analyze it. 
Response attributes :
There are a few other attributes it returns in response, but the ones mentioned above were useful to us, so we have excluded the remaining ones. For a complete response, see the Rekognition documentation.
The implementation steps are - 
1.  The first step is to instantiate RekognitionClient as shown below - 
2.  The next step is to start face detection.  At this point, it requests the s3 bucket-id and video to analyze.  An SNS topic can also be passed in the request to chain it asynchronously and listen for the SNS notification to process the analysis result within the separate AWS lambda function or other means - 
3.  A second request is made to the Rekognition service to read the returned response after face detection is complete.  And because Rekognition is kind of a prediction machine and the score are prediction confident values, we store timestamps and the confidence score of detected face, eyes open, and smiling attributes in an array - 
4.  The array that was created in the previous step is then sorted in descending order based on the confident score of the face, eyes open, and smiling attributes to obtain the best top three frames.
5.  Then, we instructed ffmpeg to grab frames from the video at the timestamps of the first three items from the sorted array -
6.  Rekogntion returns bounding box coordinates for each detected face, which is essentially a box around the detected faces. The idea is that one might want to label faces in an image/video, similar to how we see in several surveillance systems. A BoundingBox has the following properties :
We add 100px padding around the box coordinates before taking full-height stills and cropping the edges off -
Google’s Video Intelligence API 
Conceptually/feature and working wise it is almost similar to Rekognition service and works with both stored and local images/videos. Standard live streaming protocols such as RTSP, RTMP, and HLS are also supported. 
Response attributes :
There are a few other attributes it returns in response, which aren’t mentioned here.
The implementation steps are -
1.
 The first step is to set up the VideoIntelligenceServiceClient and then start the face detection by passing the URL of the stored video or the local file encoded into base64 string in the request -  
2.  After face detection is complete, the next step is to read the returned response.  Video Intelligence divides the period for which faces are visible into little segments.  And then, similar to Rekognition - Video Intelligence API returns segment start / end time as well as the detected facial attributes along with prediction score (value ranges between 0-1) for each attributes.  For frame grabbing, we use the median value of segment start/end time and store it in an array along with the prediction score of eyes open, looking at camera, and smiling attributes -
3.  The array that was created in the previous step is then sorted in descending order based on the confident score of the looking at camera, eyes open, and smiling attributes to obtain the best top three frames.
4.  Then, we instructed "ffmpeg" to grab frames from the video at the timestamps of the first three items from the sorted array similar to what we did with Rekognition-
5.  Video-Intelligence, like Rekognition, returns bounding box coordinates for each detected face, and Video-Intelligence calls it "normalizedBoundingBox," but it was incompatible with the crop function we wrote, so we ended up using Rekognition again, this time just to get the bounding box from the frame (Note - If you're going with Video-Intelligence, avoid using Rekognition and write your crop function so that it works with normalizedBoundingBox), we then used the same cropping mechanism we used for Rekogntion.
Observations
Conclusion
The outcomes of our observations show that AWS's Rekognition service performs better than Google Video Intelligence API, for use cases stated in the beginning - it’s kind of obvious which one’s best when you look at the thumbnails generated from the same videos.
Results
Video1 ( recorded using Zoom )
Rekognition :
Video intelligence API :
Video2 ( recorded using Zoom )
Rekognition :
Video intelligence API :
Video3 ( recorded using Zoom )
Rekognition :
Video intelligence API :
Video4 ( recorded using IVS )
Rekognition :
Video intelligence API :
Video5 ( recorded using IVS )
Rekognition :
Video intelligence API :
Video6 ( recorded using IVS )
Rekognition :
Video intelligence API :
Note - To obtain the frames from the video, one must use ffmpeg or another service. However, ffmpeg asks for the frame’s time in seconds or "00:00:00:00" format timestamp, and converting the milliseconds in this format could occasionally cause a few milliseconds/nanoseconds inaccuracy. This is a good reason for why Rekognition and Video Intelligence API should itself return the detected frame in the original resolution.