Automation in BPO: OCR, Computer Vision and Word Embedding technologies in Accounting – Part 2

published on 14 January 2022 | reading time approx. 4 minutes

OCR (Optical Content Recognition) software has been known for many years. In fact, the first patent for an OCR solution was granted in the US in 1931 (Patent no. 1,838,389) and then acquired by IBM. Things accelerated in the late 80’s when LexisNexis bought the solution from Kurzweil as one of the first commercial customers. The use case there was to automate the upload of legal documents to an on-line database.

Two years later Kurzweil sold its OCR business to Xerox who took it further down the development path. Now there are many commercial OCR solutions available like Abbyy FineReader or Adobe Acrobat Pro DC as well as free open source ones like tesseract-ocr. Tesseract, currently used by e.g. UiPath, started in 1996 by Hewlett-Packard and is developed by Google since 2001. It is offered now for free as an open license basis. This means you can use it and modify its source code that is convenient in terms of building your own component based solutions via direct code integration or convenient low-level API (Application Programming Interface).

As a core component of the solution OCR you will find a database of millions of representations of every alphabet letter available and algorithms that compare pixel representations of letters spotted on scanned documents to their electronic representations. In other words when an OCR system sees the letter “N” it doesn’t really see the “N” letter but the vector of pixels representing this letter. Then it fits this vector’s numeric representation to many more vectors of “N” and every other letter in its database, and gives you the best probable fit – with some luck it will propose that it’s actually the letter “N”.

But this is of limited value to us. We usually want more out of the OCR solution. Usually OCR also has additional processing features embedded such as discovering whole words, sentences or places in the document which fit some particular meta characteristic (e.g. invoice number or gross amount). This is done by using another algorithm element named Computer Vision. Computer Vision captures the elements of the picture, video or voice recordings that were previously defined in its library and fit the predefined patterns in accordance with an algorithm that learned this predefined element based on many similar historical examples. A great example here would be detecting human faces out of the moving crowd of people or identifying road signs for self-driving cars. On the scanned document it can tell you where the whole “No. of invoice: 1u/8889/11” is placed. And this is the knowledge you need to scope out the matching field and put it in the system under the “Invoice no.” or “Document no.” field.

And then, if you would like to go outside the OCR realm into the Artificial Intelligence area – you can think about ontology extraction & semantic analysis. If you have a word in the document, there is no way your computer will understand this word. And for sure it cannot associate it with other definitions and values. In other words, there is no meaningful information embedded in this word.

Let’s say the invoice is about pencils. Well, the word pencil means nothing to a machine. It is just a string of Zero’s and One’s - 00111110001010101010000011010101…. - but we want to know if “pencil” is similar to e.g “office supply materials”. This is a job for another algorithm called Word Embedding.

Word embedding changes your word to a vector representation and places this vector in the n-dimensional space. Example of some part of the vector for word “cat” can be seen above. Then it looks for distances of this vector and other vectors of other words for the respective language. But then you would have to have every word in the chosen language already processed in the form of word vectors? Yes that’s true! You can use implementations of algorithms like GloVe, BERT or FastText, that have these representations for whole languages already ready for you to use. And then your computer can check if the word pencil is distanced further or closer in relation to other words. Is it close to vector representations of words like “office”, “paper”, “ink”, “writing” and such. You can then have a layer of understanding and based on that, conduct your Artificial Intelligence development with this rich information context. Now you can actually teach a machine something (thus Machine Learning), because the machine is able to understand you.

But coming back to the main focus – accounting for purchase invoice. Why do we even need the OCR, Computer Vision and Word Embedding technologies?

Well, let us think about two extreme variations of the process of accounts payables – on one hand we would have an accountant who is receiving the invoice in paper format, he or she checks the invoice and manually inputs the main information from the invoice to the accounting system. Then the booking account is assigned and tax rates confirmed. Done. The booking is in the system and from now on, every reporting process, tax compliance process or auditing process will be dependent on this booking entry. The whole process for this one invoice can take from a few to a dozen minutes.

But there are a lot of variations that could go wrong. Starting from formal inconsistencies such as a wrong Tax ID, bank account, mistakes in the supplier data or wrong currency exchange date finishing on substantial questions like “Is this an operational cost or part of the research project?” or “What kind of asset is this”? Many times you can also have costs that were or were not approved by a certain business unit or the invoice was issued twice. So for one invoice with clear and correct information it is not a complicated process, but for an invoice booking eco-system, it’s getting more complicated.

If we have many more of these invoices, it gets even more time consuming. And normally with higher volumes we have a tendency to improve processes. When we optimize them we usually look at the manual tasks that are repeatable and try to streamline them. In accounts payable, if the process of passing information from document scans to the transactional system, checking formal correctness and describing invoices, is a manual task, then we usually look into it. Here is where OCR systems and Workflow systems come in.

If we then have not only 1 but 10000 invoices to process, it is easier to have a document processing unit and accounting unit with traceable workflow systems to manage the flow of documents as well. But the efficiency gain can reach no higher than a certain level – up to 10-30 per cent based on my professional experience. What next?