Natural Language Processing
Spiketrap pushes the limits of what can be done to automate natural language understanding. Our main differentiator comes from augmenting latest NLP techniques with years of data focused on the gaming language. Off-the-shelf tokenizers and taggers heavily depend on local linguistic features, such as capitalization, POS/NER tags of previous words; due to this reason, they are well-known to provide unsatisfactory results when dealing with error-prone, badly-formatted, short texts such as tweets. Our NLP technologies are built in-house to overcome such challenges.
Spiketrap combines a structured database of media and entertainment-related intellectual properties (including companies, franchises, characters, movies, television shows, video games, DLCs, and even more!) with a network of classifiers to identify which pieces of content are talking about which entities, even when those entities are not present in the text. Additionally, vanilla string-matching simply does not work: common language usage is filled with ambiguity, abbreviations, and acronyms: "See", "Anthem", "Dreams", "Control"; companies such as "2K" or "Blizzard"; the "Switch" console; or "WoW" to mean "World of Warcraft", or "PS" for "PlayStation." We have learned that passionate discourse rarely gives a free ride in terms of grammar, capitalization, or in-quoting!
One of the hot topics of the last couple decades, sentiment analysis is still going strong due to recent incremental SOTA (state-of-the-art) improvements thanks to recurrent, convolutional, transformers, or more complex neural networks. Too bad these improvements on the typical IMDb dataset barely make a scratch when applied to the language of the internet! Our sentiment classifiers are trained against our large in-house labeled dataset, and vastly outperform latest classifiers and off-the-shelf sentiment services when applied to the media and entertainment landscape. Apart from the actual classification, another core challenge is the extremely diverse set of data to process: think of a two-emoticon chat message versus a long article with dozens of paragraphs.
Call it topics/conversations/trends, our main area of research, besides sentiment, revolves around organic data segmentation. In order to have actionable insights and be able to focus on what’s important, we invest our energy in ensuring we can automate topic discovery. LDA and other standard probabilistic topic models do not perform well on short texts, and this issue is exacerbated when fronted with a corpus of documents of dissimilar lengths. In addition, parameter estimation of common topic models --usually Gibbs sampling or variational inference-- does not scale well with large datasets.
To make the problem of topic discovery even more challenging, consider that even if you can identify meaningful topics for a specific window of time, you still need to relate them over the course of weeks, months, and years. Our proprietary methodology addresses all these challenges providing you with meaningful and coherent conversation over time.