Erstellt von Cornelius Koller
Most online 'creator' persona have some sort of 'core product' they are involved in. These can be things like beauty, sports, personal growth, tech or finance. Usually, this core content is then marketed and distributed in multiple forms on multiple platforms, resulting in multiple income streams from one primary value stream.
While publicly listed companies are obliged to publish their IFRS (group) accounting statements, they are not obliged to provide easy-to-use file formats like spreadsheets. Instead, most companies publish PDF files which are usually 300 -400 pages long, while <10 are relevant for the IFRS statements (Balance Sheet, Other comprehensive income, cashflow statement, equity changes and profit and loss statement). Generally speaking, these PDFs are not optimized for automatic data extraction (there may be a rationale behind this). While it is possible to buy the data from companies like Bloomberg, this is usually not possible for retail investors, researchers or ad-hoc queries.
Generally speaking, there are three sub-problems:
The first issue requires a classification of the page contents. While it is absolutely possible to perform this operation with a machine learning model, I did not pursure this approach due to the amount of training data needed. Instead, I implemented a relatively simple algorithm that relies on a few assumptions about the document structure:
While there is a number of tools to extract tables from PDFs, they usually rely on specific characters to separate the individual columns. This does not work properly for the accounting reports as they frequently use this layout (Deutsche Wohnen, annual report 2020):
We can see that there is only whitespace in between columns. This is easily readable for humans, but not for computers.
As I am not aware of any tool that is able to separate columns by whitespace, I decided to implement the algorithm myself (it is, however, not perfect, given the relatively short timespan during which I developed it). Basically, the algorithm utilizes the fact that PDF is a graphical format. This means that any text is contained in an object with x and y coordinates. from the information where text is we can derive the information where nothing is and generate column separators from that information. Then we can sort all text into the columns generated by the algorithm.
The output is then an excel file like this (derived from Deutsche Wohnen SE annual report 2020):
The functionality is available as an API, which can be made pay-per-use by selling API keys to potential users. An (unsecured) demo is running here: http://146.148.45.196/
Simply put, while the API products can be offered as-is, it is also feasible to integrate them into an own website and then put advertisements onto that website.
An example of a simple UI is depicted in this screenshot: