The Challenge of Influencer Selection
Advertisers who promote products and services through influencer posts face a significant problem called targeting. In advertising platforms like AdWords, advertisers can focus on specific keywords to narrow down their target audience, ensuring that it matches the products and services they’re promoting. However, this capability doesn’t exist on social networks like Instagram, where millions of potential influencers can promote ads. How can an advertiser choose influencers whose content aligns with their campaign?
One option is for the advertiser to actively use Instagram and be familiar with popular influencers they are already following. Alternatively, they could advertise blindly, hoping to reach at least a small percentage of their target audience. Clearly, both options are suboptimal, making Instagram an inefficient choice for niche, medium, and small advertisers.
Solving this problem would make Instagram advertising accessible and convenient for everyone, not just major brands.
Thematic Classifier? No, Thank You.
The most obvious way to categorize influencers into thematic segments is through a thematic classifier. First, a tree of topics is manually created. Then, each influencer is manually or automatically (using machine learning) assigned to categories within this tree.
Most advertising systems, including large ones like Facebook Ads, follow this approach. However, this method is inefficient. The number of topics potentially interesting to advertisers is vast, and one could even argue that it is infinite. The more specialized an advertiser’s niche, the more granular the classification they require.
- For example, there is a topic called Food.
- Clearly, a restaurant needs not just Food, but Food : Restaurants.
- If it’s a Thai restaurant, an even narrower topic is needed: Food : Restaurants : Thai Cuisine.
- If the restaurant is located in London, it would be beneficial to include a topic like Food : Restaurants : London, and so on.
However, a thematic tree cannot grow indefinitely:
- It would simply become unwieldy for advertisers to work with. The size of a classifier that an average person can easily comprehend and retain in their memory is no more than 100 items.
- The more categories there are, the more difficult it becomes to assign influencers to specific categories, and the more work is needed for this task. Manually doing this work for several million influencers is virtually impossible. Machine learning can be used, but an initial training dataset with examples of correct influencer classification must be created. Humans create these datasets, and humans make mistakes, have subjective views, and evaluations (different assessors may classify the same influencer into different categories). Considering that many influencers work in multiple topics simultaneously, the problem becomes even more complex.
- A lot of effort will be required to maintain the classifier in an up-to-date state (new trends constantly emerge, new topics open, and old ones fade), as well as to update the training dataset.
As a result, a small thematic tree would be too inflexible and provide too coarse a division, while a large tree would be cumbersome for both advertisers and its creators. A different solution is needed.
Advertisers Define the Topic
What if the topic could be defined by a set of keywords, similar to AdWords? This way, advertisers can customize any topic, no matter how narrow, without having to navigate a massive classifier. However, in AdWords, keywords correspond to what potential customers explicitly search for, but what should keywords correspond to on Instagram? Hashtags? It’s not that simple.
- An influencer writing about cars isn’t necessarily going to use the hashtag
#car
; they might use#auto
,#fastcars
,#wheels
,#drive
, or brand names like#bmw
,#audi
, etc. How should an advertiser guess which specific tags influencers use? In principle, this issue also exists in AdWords: advertisers need to put in considerable effort to cover all possible keywords in their niche. - An influencer might use a hashtag, like
#car
, accidentally if they bought a new car or simply saw and photographed an interesting vehicle on the street. This doesn’t mean they write about automotive topics. - Influencers often use popular hashtags unrelated to their post’s theme just to appear in search results for that tag (hashtag spam). For example, the tag
#cat
could be attached to a photo of a bearded hipster, a sunset landscape, or a selfie in a new outfit surrounded by friends.
Therefore, selecting influencers based solely on the presence of advertiser-defined hashtags will yield poor results. More sophisticated methods are needed to address this challenge.
Topic Modeling: The Theory
In modern natural language processing techniques, there is a field called topic modeling. The simplest way to explain the application of topic modeling to our problem is with a straightforward example.
Imagine a very primitive social network where people have only two main interests - an interest in food (Food) and an interest in Japan (Japan). If the “strength” of interest is represented by a number between 0 and 1, then any hashtags used by influencers can be placed on a 2D diagram.
As seen, any hashtag can be described by a pair of numbers corresponding to the X and Y coordinates in the topic space. Tags related to a single theme group together in clusters, meaning they have similar coordinates. Using this diagram, the relevance of a post to specific topics can be calculated by determining the “average” coordinates in the topic space for all tags included in the post (i.e., obtaining the centroid). The centroid’s X and Y coordinates will correspond to the post’s relevance to the Food and Japan topics; the closer the coordinate is to 1, the higher the relevance. By calculating the centroid for all of an influencer’s posts in the same way, the overall relevance of their content to specific topics can be understood.
In real-world topic modeling, of course, not just two topics are used, but rather dozens or even hundreds. Consequently, tags exist in a high-dimensional space. A more mathematical definition is as follows:
- There is a set of documents $D$ (in our case, these are posts), a set of words $W$ (in our case, these are tags), and a set of topics $T$, the size of which is predetermined.
- The content of the documents can be represented as a set of document-word pairs: $(d, w), d \in D, w \in W_d$
- Each topic $t \in T$ is described by an unknown distribution $p(w|t)$ over the set of words $w \in W$
- Each document $d\in D$ is described by an unknown distribution $p(t|d)$ over the set of topics $t\in T$
- It is assumed that the distribution of words in documents depends only on the topic: $p(w|t,d)=p(w|t)$
- During the construction of the topic model, the algorithm finds the “word-topic” matrix $\mathbf{\Phi} =||p(w|t)||$ and the “topic-document” matrix $\mathbf{\Theta} =||p(t|d)||$ based on the content of the collection $D$. We are interested in the first matrix.
Topic modeling is equivalent to non-negative matrix factorization (NMF). The input is a sparse “word-document” matrix $\mathbf{S} \in \mathbb{R}^{W \times D}$, which describes the probability of encountering word $w$ in document $d$. The low-rank matrices $\mathbf{\Phi} \in \mathbb{R}^{W \times T}$ and $\mathbf{\Theta} \in \mathbb{R}^{T \times D}$ are computed to approximate it. $$\mathbf{S} \approx \mathbf{\Phi}\mathbf{\Theta}$$ More detailed information on topic modeling and its algorithms can be found on Wikipedia.
Topic Modeling, Practice
In practice, topic modeling has shown mediocre results. The table below presents the modeling results for 15 topics using the BigARTM library:
Topic | Top tags |
---|---|
0 | sky, clouds, sea, spring, baby, ocean, nyc, flower, landscape, drinks |
1 | beer, vintage, chill, school, rainbow, yoga, rock, evening, chicago, relaxing |
2 | sweet, chocolate, dance, rain, nike, natural, anime, old, wcw, reflection |
3 | foodporn, breakfast, delicious, foodie, handmade, gold, instafood, garden, healthy, vegan |
4 | architecture, california, lights, portrait, newyork, wine, blonde, familytime, losangeles, thanksgiving |
5 | nature, travel, autumn, london, fall, trees, tree, photoshoot, city, cake |
6 | flowers, design, inspiration, artist, goals, illustration, pizza, ink, glasses, money |
7 | winter, snow, catsofinstagram, sexy, cats, cold, quote, fire, disney, festival |
8 | work, mountains, paris, football, nails, video, florida, diy, free, japan |
9 | dog, puppy, wedding, dogsofinstagram, dogs, roadtrip, painting, trip, thankful, pet |
10 | coffee, quotes, river, yum, moon, streetart, sleepy, music, adidas, positivevibes |
11 | style, fashion, party, home, model, music, dress, goodvibes, couple, tired |
12 | fitness, motivation, gym, workout, drawing, dinner, fit, sketch, health, fresh |
13 | beach, lake, usa, shopping, hiking, fashion, kids, park, freedom, sand |
14 | makeup, cat, yummy, eyes, snapchat, homemade, tattoo, kitty, lips, mom |
It can be seen that some reasonable structure is discernible, but the topics are far from perfect. Increasing the number of topics to 150 results in a relatively small improvement.
Perhaps the reason is that topic modeling is designed to work with documents containing hundreds or even thousands of words. In our case, the majority of posts have only 2-3 tags.
BigARTM has a large number of hyperparameters and possible ways to apply them (at the beginning of training, at the end, to all topics, to individual topics, etc.). It is possible that better results could be obtained with certain settings, but TopicTensor is a commercial project that implies time limits for implementation. With topic modeling, there was a risk of spending all the project time on hyperparameter tuning and still not achieving satisfactory results. Other libraries (Gensim, Mallet) also showed rather modest results.
Therefore, a different, simpler, and at the same time more powerful modeling method was chosen. $ \newcommand{\sim}[2]{\operatorname{sim}(#1,#2)} $
TopicTensor Model
The main advantage of topic modeling is the interpretability of the obtained results. For any word/tag, a set of weights is generated, showing how close the word is to each topic in the entire set.
However, this advantage also imposes serious limitations on the model, forcing it to strictly adhere to a fixed number of topics, no more and no less. In reality, the number of topics in a large social network is virtually infinite. Therefore, by removing the requirement for interpretability of topics (and their fixed quantity), the training process becomes more efficient.
As a result, a model is obtained that is conceptually similar to the well-known Word2Vec model. Each tag is represented as a vector in an $N$-dimensional space: $w \in \mathbb{R}^N$. The degree of similarity (i.e., how close the topics are) between tags $w$ and $w’$ can be calculated as the dot product: $$\sim{w}{w’}=w \cdot w’$$ as the Euclidean distance: $$\sim{w}{w’}=\|w-w’\|$$ or as cosine similarity: $$\sim{w}{w’}=\cos(\theta )=\frac{w \cdot w’}{\|w \|\|w’ \|}$$
The model’s task during training is to find tag representations that will be useful for one of the following predictions:
- Based on one tag, predict which other tags will be included in the post (Skip-gram architecture)
- Based on all the post’s tags except one, predict the missing tag (CBOW architecture, “bag of words”)
- Take two random tags from the post, and predict the second tag based on the first
All these predictions boil down to the fact that there is a target tag $w_t$ that needs to be predicted, and a context $c$, represented by one or more tags included in the post. The model must maximize the probability of the tag depending on the context, which can be represented as a softmax criterion:
$$P(w_t|c) = \operatorname{softmax}(\sim{w_t}{c})$$ $$P(w_t|c) = \frac{\exp(\sim{w_t}{c})}{\sum_{w’ \in W}\exp(\sim{w’}{c})}$$
However, calculating softmax over the entire set of tags $W$ is expensive (the training can involve a million tags or more), so alternative methods are used instead. They involve a positive example $(w_t,c)$ that needs to be predicted, and randomly selected negative examples $(w_1^{-}, c), (w_2^{-}, c),\dots,(w_n^{-}, c)$, which serve as a sample of how not to predict. Negative examples should be sampled from the same tag frequency distribution as in the training data.
The loss function for a set of examples can take the form of binary classification (Negative sampling in classic Word2Vec): $$L = \log(\sigma(\sim{w_t}{c})) + \sum_i\log(\sigma(-\sim{w_i^-}{c}))$$ $$\sigma(x) = \frac{1}{1+e^{-x}}$$ or work as a ranking loss, pairwise comparing the “compatibility” with the context of positive and negative examples: $$L = \sum_{i} l(\sim{w_t}{c}, \sim{w_i^-}{c})$$ where $l(\cdot, \cdot)$ is a ranking function, often using the max margin loss: $$l=\max(0,\mu+\sim{w_i^-}{c}−\sim{w_t}{c})$$
The TopicTensor model is also equivalent to matrix factorization, but instead of a “document-word” matrix (as in topic modeling), a “context-tag” matrix is factorized here. Under certain types of predictions, this matrix turns into a “tag-tag” co-occurrence matrix.
Practical Implementation of TopicTensor
Several possible ways of implementing the model were considered: code in Tensorflow, code in PyTorch, the Gensim library, and the StarSpace library. The latter was chosen as it requires minimal effort for modification (all necessary functionality is already available), provides high quality, and scales almost linearly on any number of cores (32 and 64-core machines were used to speed up training).
By default, StarSpace uses the max margin ranking loss function and cosine distance as the vector similarity metric. Subsequent experiments with hyperparameters showed that these default settings are optimal.
Hyperparameter Tuning
Before the final training, hyperparameter tuning was conducted to find a balance between quality and acceptable training time. The quality was measured as follows: a sample of posts that the model had not seen during training was taken. For each tag in the post (a total of $n$ tags in the post), the most similar candidate tags were found according to the cosine similarity criterion from the set of all tags $W$: $$candidates_i=\operatorname{top_n}(\sim{w_t}{w’}, \forall w’ \in W)$$ $$i \in 1 \dots n$$ The number of these candidates that matched the actual tags in the post was calculated (number of matches $n^{+}$). $$quality=\frac{\sum n^+}{\sum n}$$ The quality is the percentage of correctly guessed tags for all posts in the sample. This quality assessment is the closest to the real-life use of the model, where a user will mostly provide one starting tag and select other tags, bloggers, etc., based on it.
This assessment also implies that it is most optimal to train the model using the skip-gram criterion (predicting the remaining tags from a single tag). This was confirmed in practice: skip-gram training showed the best quality, although it turned out to be the slowest.
The following hyperparameters were tuned:
- Vector dimensions
- Number of epochs for training
- Number of negative examples
- Learning rate
- Undersampling and oversampling
The last hyperparameter is related to the fact that training on posts with fewer tags is faster than on posts with a larger number of tags. In a single pass, StarSpace randomly selects only one target tag from a post. Thus, over 20 epochs, each tag in a post containing 2 target tags will be the target, on average, 10 times, while each tag in a post containing 20 tags will be the target, on average, only once. The model will overfit on short tag lists and underfit on long ones. To avoid this, undersampling should be applied to “short” posts and oversampling to “long” posts.
Data Preparation
Tags were normalized: converted to lowercase, diacritical marks removed (except for cases where the mark affects the meaning of the word).
For training, tags that appeared in the training set at least N times across different bloggers were selected to ensure a variety of contexts for their use (depending on the language, N varied from 20 to 500).
For each language, a sample of the top 1,000 most common tags was created, and in this sample, a blacklist was created for commonly used words that do not carry thematic weight (e.g., me, you, together, etc.), numerals, color names (red, yellow, etc.), and some tags particularly favored by spammers.
Each blogger’s tags were reweighted according to their frequency of use by that blogger. Most bloggers have “favorite” tags and combinations used in almost every post, and if their weight is not reduced, an actively writing blogger can skew global statistics on the use of their favorite tags, causing the model to learn the preferences of that particular blogger.
The final training set consisted of approximately 8 billion tags from 1 billion posts. Training took more than three weeks on a 32-core server.
Results
The obtained embeddings showed excellent separation of topics, good generalization ability, and resistance to spam tags.
A demo sample of the top 10K tags (English language only) is available for viewing in the Embedding Projector. Following the link, switch to t-SNE mode (tab in the lower-left corner) and wait for approximately 500 iterations to build the 3D projection. It is best to view in Color By = logcnt
mode. If you don’t want to wait, in the lower-right corner, there is a Bookmarks
section; select Default
to load a pre-calculated projection immediately.
Examples of Topic Formation
Let’s start with the simplest case. We will define a topic with one tag and find the top 50 relevant tags.
Tags are colored according to relevance. The size of the tag is proportional to its popularity.
As seen, TopicTensor has excellently formed the ‘BMW’ topic and found many relevant tags that most people are not even aware of.
Let’s make the task more complex and form a topic from several German car brands (finding tags that are closest to the sum of input tag vectors):
In this example, we can see TopicTensor’s ability to generalize: TopicTensor understood that we are referring to cars in general (tags #car
, #cars
). It also recognized the preference for German cars in the topic (tags outlined in red) and added the “missing” tags: #porsche
(also a German car brand), and alternative tag spellings not present in the input: #mercedesbenz
, #benz
, and #volkswagen
.
Let’s make the task even more complex and create a topic based on the ambiguous tag #apple
, which can represent both a brand and a simple fruit. It is evident that the brand topic dominates, but the fruit theme is also present in the form of tags #fruit
, #apples
, and #pear
.
Let’s try to extract a purely “fruit” theme by adding several tags related to the Apple brand with negative weights. We will look for tags closest to the weighted sum of input tag vectors (by default, the weight is equal to one): $$target = \sum_i w_i \cdot tag_i $$
As seen, the negative weights removed the brand theme, leaving only the fruit theme.
TopicTensor is aware that the same concept can be expressed by different words in different languages, as demonstrated with the #mirror
example. Along with the English mirror and reflection, the model selected: зеркало and отражение in Russian, espejo and reflejo in Spanish, espelho and reflexo in Portuguese, specchio and riflesso in Italian, and spiegel and spiegelung in German.
In the last example, it can be seen that casual topics work just as well as brand-related ones.
Influencer Selection
For each influencer, their posts are analyzed, and the vectors of all the tags within them are summed up. $$\beta=\sum_i^{|posts|}\sum_j^{|tags_i|} w_{ij}$$ where $|posts|$ is the number of posts, and $|tags_i|$ is the number of tags in the $i$-th post. The resulting vector $\beta$ represents the blogger’s theme. Next, influencers with a thematic vector closest to the user-defined thematic vector are identified. The list is sorted by relevance and presented to the user.
The influencer’s popularity and the number of tags in their posts are also taken into account, as otherwise, influencers with just one post containing one user-defined tag would top the list. The final score used to sort influencers is calculated as follows: $$score_i = {\sim{input}{\beta_i} \over \log(likes)^\lambda \cdot \log(followers)^\phi \cdot \log(tags)^\tau}$$ where $\lambda, \phi, \tau$ are empirically selected coefficients within the range $0\dots1$
Calculating cosine distance for the entire array of influencers (involving several million accounts) is time-consuming. To speed up the selection process, the NMSLIB (Non-Metric Space Library) was employed, reducing the search time by an order of magnitude. NMSLIB pre-builds indices based on vector coordinates in space, enabling much faster computation of top similar vectors by calculating cosine distance only for relevant candidates.
Demo Website
A demonstration website with a limited number of tags and bloggers is available at http://tt-demo.suilin.ru/. On the site, you can experiment with topic formation and selecting influencers based on the generated themes.
Topic Lookalikes
The $\beta$ vectors calculated for influencer selection can also be used to compare influencers with each other. Essentially, lookalikes are the same as influencer selection, but instead of a tag vector, the input is the topic vector $\beta$ of a user-defined influencer. The output is a list of influencers whose themes are close to the topic of the specified influencer, sorted by relevance.
Fixed topics
As mentioned earlier, TopicTensor does not have explicitly defined topics. However, sometimes it is necessary to associate posts and influencers with a fixed set of topics for simplifying search or ranking influencers within separate topics. This gives rise to the task of extracting fixed topics from the tag vector space.
To address this issue, unsupervised learning was chosen to avoid subjectivity in defining possible topics and to save resources, as manually reviewing hundreds of thousands of tags (even just 10% of them) and assigning topics to them is a labor-intensive task.
The most obvious method for topic extraction is clustering the vector representation of tags, with one cluster representing one topic. Clustering was performed in two stages since no algorithms capable of efficiently searching for clusters in 200D space currently exist.
In the first stage, dimensionality reduction was conducted using the UMAP technology. In some sense, UMAP is an enhanced version of t-SNE (although based on entirely different principles), as it operates faster and better preserves the original topology of the data. The dimensionality was reduced to 5D, with cosine distance used as the distance metric, and the other hyperparameters were chosen based on the results of clustering (the second stage).
In the second stage, clustering was performed using the HDBSCAN algorithm. The clustering results (for English language only) can be viewed on GitHub. Clustering identified about 500 topics (the number of topics can be adjusted within wide limits using UMAP and clustering parameters), with 70%-80% of tags included in the clusters. Visual inspection revealed good thematic coherence and no noticeable correlation between clusters. However, for practical application, the clusters require refinement: assembling them into a tree, removing useless clusters (e.g., personal names cluster, negative emotions cluster, common words cluster) , and combining some clusters into a single topic.