One of the key values of RavenPack for our customers is the ability of our products to deliver relevant information in real-time for their decision-making.
NER plays an important role in how RavenPack identifies these relevant aspects within a news story. Having prior knowledge of the relevant commodity, company or sector, to name a few, is paramount in providing high-quality information in a timely manner. In order to accomplish this, we maintain a database of around 400,000 and increasing predefined entities with 16 distinct types.
What happens when new companies, like Stripe, or currencies, like BitCoin, start appearing in the news? To keep up with the pace of the market, we wanted to create a tool using a Deep Learning approach to assist our teams in detecting new entities that are not currently present in our database so they could be considered for inclusion.
HuggingFace offers pre-trained models on the NER task based on various architectures. They perform really well without further tuning, as you can see in this example
Aluminum prices have declined in recent months on concerns about the eurozone crisis and its implications for demand, with the London Metal Exchange (LME) three-month aluminum price down to $ 2,100 a tonne from its peak of $ 2,800 in May.
The models are trained using the English version of the standard CoNLL-2003 Named Entity Recognition dataset. They are capable of recognizing four types of entities:
- LOC: location.
- PER: person.
- ORG: organizations.
- MISC: miscellaneous.
However, for us, this is not enough. You can see the difference between what the off-the-shelf models were able to identify and what our current system identifies:
At RavenPack we have a rich corpus of data and use 16 different types of entities that are important for a better understanding from a financial standpoint As you can see in the non-exhaustive list below, some are similar to those of the pre-trained models, like PEOP and PER, while on the other hand, we distinguish between COMP for companies and ORGT for organizations.
- PEOP: people.
- ORGT: organizations like colleges, NGOs, etc.
- COMP: companies.
- PRDT: products.
For these reasons, we are going to leverage the capabilities of HuggingFace and pre-trained models, and fine-tune them for a NER task using a custom dataset.