Quant Lab - NLP-enabled impact investing

NLP-enabled impact investing

Maxence Jeunesse PhD , Senior Data Scientist Quantlab – Core Investments

Guillaume Chevalier , Data Scientist Quantlab – Core Investments

Thomas Roulland, CFA, FRM, CIPM , Head of Solutions, Models and Tools Responsible Investment – Core Investments

 

Key points

  • AXA IM wants to be able to assess information on how companies have a positive impact regarding the United Nations Sustainable Development Goals (SDGs). To develop and refine such a process, we want to equip analysts with a tool that helps them to rapidly analyse thousands of documents.

  • Our analysts have already studied hundreds of companies. By combining our internal knowledge pool with a set of Wikipedia paragraphs generated from a list of keywords, we built a training set which can be used by a supervised learning algorithm to score paragraphs of corporate documents. We plan to use this data to understand how we can relate these scores to the primary and secondary SDG of each company.

  • The integration of natural language processing (NLP) outputs into the SDG framework will help cover a broader universe of companies in a time-effective manner - with these new datasets, we will be able to react quicker in terms of our investment decisions. With this tool, and other initiatives, we are looking to develop augmented intelligence – that is, a unique combination of technology, quantitative data, as well as financial and responsible investing expertise.

AXA IM: An active, responsible investor

AXA IM is an active, long-term, global, multi-asset investor. We work with our clients today to provide the solutions they need to help build a better tomorrow for their investments, while creating a positive change for the world in which we all live. Responsible Investment (RI) has a deep history at AXA IM with a dedicated team for 20 years. AXA IM utilises a proprietary methodology to evaluate environmental, social and governance (ESG) pillars for each company and alongside this, analysts have studied businesses through the spectrum of their ESG policies, climate actions and engagement in relation to the United Nations’ Sustainable Development Goals (SDGs).

Within AXA IM’s dedicated RI team, the Solutions, Models and Tools team acts as the reference partner for our fund managers, in terms of ESG and SDG integration. Part of the team’s mission is to deliver innovative methodologies and tools to assess ESG risks. These tools, which help analysts and portfolio managers navigate company reports, include natural language processing (NLP) and automated indexing, based on text classification.

Natural language processing

According to Wikipedia, natural language processing (NLP) is a subfield of linguistics, computer science, information engineering and artificial intelligence concerned with the interactions between computers and human (natural) languages – how to programme computers to process and analyse large amounts of truth data. This field is very much an active research topic both for academics and companies.

“Sentence-embedding models” are mathematical representations of sentences, i.e. they try to represent a word or a sentence as a point in a vector space. Exhibit 1, below, shows how we can visually differentiate sentences related to two different SDGs using such a model.

These models proved to be efficient in text classification tasks. Indeed, some internet search engine and social media companies have open-sourced their internal models which helped us in designing our own algorithms.

Exhibit 1: Embedded representations of sub-goals from SDG 3 (Health) and SDG 5 (Gender Inequality).

Source: Spacy, UNSDG.org and AXA IM Quant Lab, as of 12 December 2019.
First and second axis of PCA applied on Spacy English Word2Vec model.

Supervised learning

Supervised learning is a machine-learning task where the expected prediction is available for each sample, when fitting the model. Simply put, we want to build an algorithm (i.e. a set of rules) that can produce similar results to the observed ones on the learning set. In our case, it means an algorithm that can reproduce the labels of each paragraph regarding their text content.

For years, AXA IM has been carrying out many studies through the spectrum of RI, across sectors and for multiple countries. This historical database of expert analyses can be used as a validation set, to check if the rules learned by the algorithm can be safely extrapolated to unknown companies.

Paragraph Scoring

Supervised learning methods suffer from the quality of their training set. A good training set has many examples and is well balanced with no selection bias. Being sure that these assumptions are verified is a very delicate task.

Training set construction via Wikipedia extension. In order to increase the number of examples, we use a generative method for labelling paragraphs. We first start, as an analyst, to list keywords associated to each SDG. For each keyword, we then associate Wikipedia pages, from which we extracted paragraphs. Paragraphs are defined as a collection of sentences.

Exhibit 2: List of keywords associated to SDG 3 (Health).

All these keywords can be related to a proper Wikipedia page. Using this methodology, we can extract around 20,000 paragraphs correctly labelled, for more than 250 keywords associated to 18 themes (17 SDGs plus one for finance terms) (Exhibit 3).

Exhibit 3: Paragraphs distribution per thematic

As shown in Exhibit 3, our training set is not perfectly balanced. The reason behind this is related to the fact we list slightly more keywords on the theme of gender equality combined with more developed Wikipedia pages associated to those keywords. In other words, if we are not careful when training the model, our imbalanced training set could end up with a model overweighting this thematic.

For each paragraph, the average number of words is 75. (Exhibit 4)

Exhibit 4: Number of words per paragraph

Source: Wikipedia and AXA IM Quant Lab

Classification task. This is one of the most classic supervised learning problems. The goal is to find an estimator that predicts in which class each data point falls according to some features associated with each data point. The problem here is multi-class based. There are 19 classes - one for each SDG plus a class of financial vocabulary examples as well as a section of randomly generated paragraphs to exclude meaningless paragraphs coming from bad parsing in the step that transforms companies’ reports into structured documents.

Models. We tested different models on Wikipedia paragraphs dataset generated from the list of keywords.

Model 1:

Features: Vectorisation of paragraphs using word embedding technique (FastText)

Algorithm: FastText algorithm

Model 2:

Features: Number of key words divided by the total number of words in the paragraph

Algorithm: Logistic Regression (using a weighting methodology to tackle imbalance in the training set)

Model 1 x Model 2:

Features: Combines outputs of Models 1 and 2.

Algorithm: Average of the logarithm of the predicted probabilities for Model 1 and Model 2

Results and evaluation. To evaluate models, we used an out-of-sample dataset. This dataset is constituted of one-page summaries for 47 companies. Each of them has been written by an RI expert, who assessed in a few lines how a company is impactful with respect to some identified key SDGs.

In our evaluation, we treat each summary as a paragraph. For each paragraph, we obtain a probability associated to each SDG. The higher the probability to a given SDG, the higher the chance that the paragraph topic is related to this SDG. For comparison purpose, we add a random model (i.e. one that attributes a probability of 1/17 over the 17 SDGs).

We now compute the frequency of having the primary SDG identified by RI expertise in the set of the ‘n’ first predicted SDGs where ‘n’ goes from one to four (Exhibit 5). For each line, we highlighted the best results.

Exhibit 5: Primary SDG in the n first labels predicted

First, we see that each model is much better than the random model, which is satisfactory. Second, we see that Model 2 is not the most efficient but provides decent results given its simplicity. Third, by combining Model 1 and Model 2, we achieve slight improvements that encourage us to continue our research and development works.

Ways of improvement. At the data source level, we could extend our data set with news related to companies. At the data structuration level, the process that extracts structured paragraphs from reports could be improved by a better parsing and a better definition of how large a paragraph must be. This is not an easy task and it must not be underestimated. At the learning phase, the training set and the validation set could be extended via a broad use of a feedback mechanism.

Currently, a feedback loop (like/dislike tagging) has been implemented and is feeding the training set with true paragraphs extracted from companies’ reports. With the extension of this tool, the number of examples will represent a significant part of the training set and will make the model more valuable. At the algorithm level, we think that exploring new methodologies such as Bidirectional Encoder Representations from Transformers (BERT) and mixing them with naïve models could enhance the quality of the labelling methodology.

Integration into the SDG framework

A first – and concrete – test case of such a tool is to add model outputs into our proprietary SDG framework.

RI analysts will calibrate the model by supervising and validating the accuracy of outputs before integrating them into our AXA-IM RI quantitative framework.

Coverage. This combination of machine analysis and human judgment will save time for analysts when analyzing a specific company. Indeed, it provides an efficient way of navigating documents in relation to the SDGs - and not in a linear way. Notably some useful information can be found in pages that are not at the beginning of the document. In addition, analysts will be able to screen more companies using the tool over the same period. As a global asset manager, the coverage of our investment universe is highly critical when adding a new dataset to an existing framework and this tool could help us to rapidly ramp up to the correct coverage level.

New dimension. We believe outputs will help analysts to create a reflection around different SDGs which could have been selected by the algorithm on issuers with less subjective views. Another possibility is that we could analyze the investment universe with an SDG approach in addition to classic sectorial and geographical classifications.

A new set of variables could emerge from these outputs, shaping the design of improved signals to identify potential new impact leaders and laggards. By applying our labelling algorithm to a much broader scope of companies than the one we currently cover, we could build a recommended list of companies to invest in as well.

Ownership. In addition, having a full understanding of the input used in the model will give us a better insight into the output. A proprietary data et increases our explanatory power of scores/ratings for clients. We are also enlarging our proprietary dataset, already existing for some specific asset classes likes green and social bonds, with new SDG information on companies. Whereas peer asset managers use the same datasets coming from the same set of external ESG data provider, our proprietary dataset means a competitive advantage in a cost-effective way by avoiding additional market data costs.

Reactivity. It will resolve some timeliness limitations we are facing with the current dataset. For instance, some ESG data points rely on annual information, creating some lag in an already backward-looking approach. We believe that by applying NLP to news feeds, we could be more reactive in our investment decisions and smooth the gap between backward and real-time information.

AI as Augmented Intelligence. As an active asset manager, we place human expertise at the heart of every decision. As a stakeholder of the digital transformation, we believe in the responsible use of technology. Therefore, with this tool and other initiatives, we aim to develop augmented intelligence: a unique combination of technology, quantitative methods, financial and RI expertise.

This document is for informational purposes only and does not constitute investment research or financial analysis relating to transactions in financial instruments as per MIF Directive (2014/65/EU), nor does it constitute on the part of AXA Investment Managers or its affiliated companies an offer to buy or sell any investments, products or services, and should not be considered as solicitation or investment, legal or tax advice, a recommendation for an investment strategy or a personalized recommendation to buy or sell securities.

It has been established on the basis of data, projections, forecasts, anticipations and hypothesis which are subjective. Its analysis and conclusions are the expression of an opinion, based on available data at a specific date. All information in this document is established on data made public by official providers of economic and market statistics. AXA Investment Managers disclaims any and all liability relating to a decision based on or for reliance on this document. All exhibits included in this document, unless stated otherwise, are as of the publication date of this document. Furthermore, due to the subjective nature of these opinions and analysis, these data, projections, forecasts, anticipations, hypothesis, etc. are not necessary used or followed by AXA IM’s portfolio management teams or its affiliates, who may act based on their own opinions. Any reproduction of this information, in whole or in part is, unless otherwise authorised by AXA IM, prohibited.

This document has been edited by AXA INVESTMENT MANAGERS SA, a company incorporated under the laws of France, having its registered office located at Tour Majunga, 6 place de la Pyramide, 92800 Puteaux, registered with the Nanterre Trade and Companies Register under number 393 051 826. In other jurisdictions, this document is issued by AXA Investment Managers SA’s affiliates in those countries.

In the UK, this document is intended exclusively for professional investors, as defined in Annex II to the Markets in Financial Instruments Directive 2014/65/EU (“MiFID”). Circulation must be restricted accordingly.

 

© AXA Investment Managers 2020. All rights reserved