Erstellt von Cornelius Koller

General Idea & Rationale

Most online 'creator' persona have some sort of 'core product' they are involved in. These can be things like beauty, sports, personal growth, tech or finance. Usually, this core content is then marketed and distributed in multiple forms on multiple platforms, resulting in multiple income streams from one primary value stream.

Software Product / API

IFRS Accounting statement parser

While publicly listed companies are obliged to publish their IFRS (group) accounting statements, they are not obliged to provide easy-to-use file formats like spreadsheets. Instead, most companies publish PDF files which are usually 300 -400 pages long, while <10 are relevant for the IFRS statements (Balance Sheet, Other comprehensive income, cashflow statement, equity changes and profit and loss statement). Generally speaking, these PDFs are not optimized for automatic data extraction (there may be a rationale behind this). While it is possible to buy the data from companies like Bloomberg, this is usually not possible for retail investors, researchers or ad-hoc queries.

Generally speaking, there are three sub-problems:

  1. Extract the relevant pages from a 300-400 pages document
  2. Extract the tables on the relevant pages
  3. Tag tables with the IFRS component they represent

The first issue requires a classification of the page contents. While it is absolutely possible to perform this operation with a machine learning model, I did not pursure this approach due to the amount of training data needed. Instead, I implemented a relatively simple algorithm that relies on a few assumptions about the document structure:

  1. The relevant pages have keywords on them, for example "Konzernbilanz" or "Aktiva"
  2. There is a chapter slide right before the IFRS accounting statements that has multiple keywords followed by numbers on it.
  3. Irrelevant pages have either no keywords on them, are before the relevant chapter slide and/or have multiple keywords on them.

While there is a number of tools to extract tables from PDFs, they usually rely on specific characters to separate the individual columns. This does not work properly for the accounting reports as they frequently use this layout (Deutsche Wohnen, annual report 2020):

Screenshot from 2021-12-26 18-46-59.png

We can see that there is only whitespace in between columns. This is easily readable for humans, but not for computers.

As I am not aware of any tool that is able to separate columns by whitespace, I decided to implement the algorithm myself (it is, however, not perfect, given the relatively short timespan during which I developed it). Basically, the algorithm utilizes the fact that PDF is a graphical format. This means that any text is contained in an object with x and y coordinates. from the information where text is we can derive the information where nothing is and generate column separators from that information. Then we can sort all text into the columns generated by the algorithm.

The output is then an excel file like this (derived from Deutsche Wohnen SE annual report 2020):

dw_cropped_2020.xlsx

The functionality is available as an API, which can be made pay-per-use by selling API keys to potential users. An (unsecured) demo is running here: http://146.148.45.196/

Additional Content / Websites

Simply put, while the API products can be offered as-is, it is also feasible to integrate them into an own website and then put advertisements onto that website.

An example of a simple UI is depicted in this screenshot: