1.1 Optical Character Recognition (OCR)

OCR is a commonly-used technology to recognize text inside images. It examines the text of the documents and converts the characters into code that can be used for data processing. Proactive DLP now can utilize this technology to detect and redact sensitive information.

Supported file type:

  • Portable Document Format: PDF

  • Microsoft Office: doc, docx, xls, xlsx, ppt, pptx

  • Standalone image: jpg, png, tiff, bmp

Supported language:

  • English

To enable OCR:

Policies > Workflow rules > "Workflow name" > Proactive DLP > Optical character recognition (OCR)

images/download/attachments/6225357/image-20200921-203758.png

OCR Quality:

  • Normal: detect the information without pre-processing images

  • Best: pre-processing images before detecting the image to have a better detection rate, however, performance will be impacted

Example output:

images/download/attachments/6225357/image-20200322-032951.png

System requirement:

  • CPU must support AVX2 and SSE4.1 instruction set

Vectors can affect the accuracy

  • Low contrast documents

  • Documents with small text

  • Documents with blurry images

  • Colored paper or background in documents

  • Handwritten text

  • Unusual or script-type fonts