AI for Safety
The term ‘Artificial Intelligence (AI) ’ is used to refer to the capability of a machine to exhibit human-like behavior for a defined task. The advances in AI have been driven by machine learning, which enables a computer system to make decisions and predict outcomes. The development of ‘deep neural networks’ which enable ‘deep learning’ enables systems to recognize complex data inputs such as human speech, images, and text. These systems can then deliver specific tasks for which they have been trained.
Online content may appear in numerous different formats which are more difficult to analyze and moderate; such as video content (which requires image analysis over multiple frames combined with audio analysis) and memes (which require a combination of text and image analysis and contextual, cultural understanding). Content can also be posted as a live video stream or live text chat that needs real-time analysis. AI-based content moderation systems can reduce the need for human moderation and reduce the impact of viewing harmful content. Harmful content is generated by only a small category of users and hence AI techniques can be used to identify malicious users and prioritize their content for review. ‘Metadata’ can encode some context relevant to decision making like user’s history, the number of followers they have, and information about the user’s real identity, such as age or location. The metadata available varies based on platforms and the format of content posted and hence it is difficult to have platform-agnostic moderation tools to make moderation decisions.
PART 1: HOW IS IMAGE CONTENT MODERATED?
Image moderation will cover identifying risks associated with a Video/Image that is uploaded by a community user on a public platform.
HASH MATCHING: This is a relatively less complex solution to detect and remove harmful content. Hash Matching works by assigning a unique digital “fingerprint” to previously detected harmful images/videos. A newly uploaded content’s risk can be detected and removed from the platform during pre-moderation if the computed hash is equal to or greater than a hash stored in a database of known harmful contents. The GIFT(Global Internet Forum to Counter Terrorism) was founded by Facebook, Microsoft, Twitter, and Youtube to collectively reduce and eliminate violent and extremist content from our platforms.
OBJECT DETECTION AND CHARACTER RECOGNITION: This helps to identify specific predefined object classes within an image. Object Localisation is a simple process of finding what the object is and where it is located within an image. The below image represents how Object localization is done -here the object is a ‘CAT’ and the rectangular
RCNN(Region-based convolutional neural network) algorithm proposes a bunch of boxes in the image and checks if any of these boxes contain any object. RCNN uses selective search to extract these boxes from an image (these boxes are called regions). There are basically four regions that form an object: varying scales, colors, textures, and enclosure. The selective search identifies these patterns in the image and based on that, proposes various regions. Selective search first takes in the input image and generates sub-segmentation to create multiple regions within an image and then combines similar regions to create larger regions (based on color similarity, texture similarity, size similarity, and shape compatibility) and then finally, these regions then produce the final object locations (Region of Interest). After the Region of Interest is identified, all these regions are then reshaped as per the initial input of the CNN, and each region is passed through a convolutional network (ConvNet). ConvNet then extracts features from each region and classifies the regions using algorithms like SVM(support vector machine is a supervised machine learning model that uses classification algorithms for two-group classification problems) . Finally, the linear regression model (In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables) is trained to generate tighter bounding boxes for each identified object in the image.
CHARACTER RECOGNITION On the other hand is a machine vision technique used to identify texts within images. The steps involved in text recognition from a moving picture are as follows.
Text Detection: This phase takes an image or video frame as input and decides it contains text or not. It also identifies the text regions in image.
Text Localization:Text localization merges the text regions to formulate the text objects and define the tight bounds around the text objects.
Text Tracking: This phase is applied to video data only.For the readability purpose, text embedded in the video appears in more than thirty consecutive frames. The text tracking phase exploits this temporal occurrence of the same text object in multiple consecutive frames. It can be used to rectify the results of text detection and localization stages. It is also used to speed up the text extraction process by not applying the binarization (Binarization is the process of transforming data features of any entity into vectors of binary numbers to make classifier algorithms more efficient)and recognition step to every detected object.
Text Binarization: This step is used to segment the text object from the background in the bounded text objects. The output of text binarization is the binary image, where text pixels and background pixels appear in two different binary levels
Character Recognition
The last stage of the text extraction process is character recognition. This module converts the binary text object into ASCII text. Text detection, localization and tracking modules are closely related to each other and constitute the most challenging and difficult part of the extraction process.