work

Geolocation of Instagram bloggers

implemented for Deep.Social Inc

For advertisers, one of the most important parameters of a blogger and their audience is the geographical location corresponding to the place of residence. To make advertising campaigns effective, they usually need to be targeted to a specific country or even city.

Geotags

Instagram posts can contain information about the location where the post was made, called geotags. This is the simplest and most obvious way to determine geolocation. The page header shows a map colored according to the number of posts made in each location worldwide (built using geotags from over 4 billion posts). It can be seen that the highest intensity of posts corresponds to places with a high population density - large cities with suburbs, areas along major highways.

But geotags, despite their obvious usefulness, are not a reliable source of information about the account owner’s place of residence:

  • People often use geotagging only when they travel on business trips, vacations, or sightseeing trips or are in places they find interesting. This means that a blogger living in London may have geotags from anywhere in the world except London.
  • Not all accounts include geotagging for posts. For the majority of posts (over 85%), geotags are not provided.

Therefore, if geolocation is determined only by geotags, the results will have low coverage and accuracy. To improve the quality of determination, additional information must be used.

Additional sources of information

There are many additional, less obvious and reliable than geotags, sources of information about the blogger’s location.

  • First and foremost is the text in the account’s ‘bio’ section: it may contain the blogger’s city or local geographic names, websites or emails in national domains, emojis corresponding to country flags, etc.
  • The same information + textual tags may be contained in the text of posts and comments.
  • In addition, information about language and audience geolocation can be used – it is obvious that, for example, a Japanese audience is more likely to follow a blogger from Japan than from Mexico. It is also unlikely that a Mexican blogger will post in Japanese.
  • Additionally, approximate geoinformation can be extracted from photos in posts (when there are no geotags).

All this information is extracted using a set of rules, including both complex hand-coded heuristics and machine learning models for named entity recognition. A discussion of extraction methods could be a project in itself, so here I will focus only on the use of already extracted geoinformation.

Ensemble learning

So, there are several sources of geolocation information. Each source on its own is not very accurate or reliable, but using them together can give a much more precise result. The classical approach used for dealing with such sources is ensemble learning.

Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.1

In our particular case, a type of ensemble learning called stacking is used. The essence of this method is that there are two levels of models:

  • First level - these are the models that extract geolocation information from the Instagram data we talked about in the previous section (base classifiers).
  • Second level (meta-classifier) uses the predictions generated by the models at the first level to make the final decision.

Weighted majority voting

The simplest way to create a meta-classifier is to use weighted majority voting. There are $N$ base classifiers, each of them assigned a weight $w_i$, which corresponds to the probability of correct classification $\hat{p}_i$ (in our case, the correct identification of the blogger’s country). Probabilities are calculated from the training data. $$w_i=\log\left(\frac{\hat{p}_i}{1−\hat{p}_i}\right),\quad i=1,\dotsc,N$$

Classifiers that often make mistakes are assigned a lower weight, and classifiers with high accuracy are assigned a higher weight. This completes the “training”.

Usage: for the account of interest, each of the base classifiers is assigned class labels (in our case, the presumed countries), resulting in a set of labels $l_1,\dotsc,l_N$. The values that labels can take are the set of all countries with a size of $M$: $(c_1,\dotsc,c_M)$. The rating of each class is calculated by summing the weights of all votes cast for it.

$$R(k)=\sum_{l_i=c_k}w_i,\quad k=1,\dotsc,M$$

The resulting class (country) is the one with the highest rating, i.e., the one with the most votes with the highest weight:

$$k^*=\operatorname*{arg} \operatorname*{max}_{k=1}^M R(k)$$

The first model for geolocation determination used this method as the simplest to implement. But it has drawbacks:

  • The predictions of base classifiers can be incorrect. For example, the country Japan can be extracted from the bio, and Brazil and Germany from the geotags, while the blogger actually lives in England. Therefore, the meta-classifier should not only predict the country but also provide the probability of the correct answer. If the probability obtained is below some threshold, it should be considered that we do not have enough information to determine the country. But majority voting does not provide the probability value; we can only estimate the upper and lower bounds.
  • Majority voting cannot handle additional information that is not a class label, such as the language of the account.
  • Majority voting assumes that the predictions of base classifiers are uncorrelated with each other and that the probability of correct classification is the same for any country. In real life, neither of these assumptions holds: we need to take into account the possible interactions between classifiers and individual features of countries.

Gradient Boosting

Considering all the drawbacks of majority voting, a more sophisticated model was developed using gradient boosting. In this project, we used the library Yandex CatBoost.

Gradient boosting is one of the best-known machine learning algorithms for tabular data. Therefore, the quality of the obtained model exceeded all expectations.

Training Data Formation

To train the model, we first needed to obtain labeling for Instagram accounts in the form of the owner’s true place of residence. Obtaining this data directly from the owners was difficult, so an indirect source was used: Twitter.

A Twitter user can specify their place of residence (location), and this information is publicly available. If the same person has an Instagram account, then we can use the geolocation information from Twitter as ground truth. Of course, geodata from Twitter is not always completely accurate, as people do not always provide their real place of residence or may move and forget to update their location. According to rough estimates, approximately 1-3% of Twitter accounts have incorrect or outdated information about their location. However, as practice has shown, this did not impede training.

The location from Twitter is a string in arbitrary format. Sometimes it specifies a country, sometimes a city, sometimes a state, and sometimes just an aphorism or a joke that has nothing to do with geography. To train the model, it is necessary to convert this data into a country code. The task of obtaining geographic coordinates from an address in arbitrary format is called geocoding. In this project, the Nominatim system created by the Open Street Maps community was used for geocoding.

Geocoding often does not have a unique solution. For example, the string “Moscow” can refer to the capital of Russia or a city in the USA. “Georgia” can be either the country of Georgia or the US state of Georgia.

To eliminate ambiguities and improve accuracy, additional analysis of Google search results was used. For each address string, the search results were checked for the presence of a “geographic location” snippet, and if found, the link to information on WikiData was analyzed. If the country from WikiData matched the country identified by Nominatim, the geocoding result was accepted; otherwise, additional manual verification was needed. The idea here is that Google, based on its own analysis of search query popularity, knows which city people usually mean when they write “Moscow.” As the results showed, such verification was useful and significantly increased the accuracy of geocoding compared to using pure Nominatim.

The Model

The most straightforward approach to modeling is multiclass classification, where each class corresponds to a separate country. However, this type of model does not perform well in practice. The problem is that the distribution of countries based on the number of accounts in them is highly uneven. There are a few dominant countries with the majority of accounts and a long tail of smaller countries or countries where Instagram is unpopular. As a result, the country identification quality for countries in the “tail” is mediocre. The models simply lack enough data for training in these countries.

The dependency of accuracy (Y-axis) on the number of country accounts in the training sample (X-axis, logarithmic scale) for one of the early models.

The dependency of accuracy (Y-axis) on the number of country accounts in the training sample (X-axis, logarithmic scale) for one of the early models.

As shown in Fig. 1, the accuracy decreases with a smaller number of accounts per country. To avoid this effect, a different method of modeling classes was chosen, conceptually similar to majority voting, where the model does not distinguish countries from each other but considers only the “strength” of votes for each country.

In the data for each account, base classifiers typically recognize no more than 3-4 different countries. From these countries, the correct answer (or the absence of an answer) must be selected. Thus, the model requires only four classes, each corresponding to one unique country in the input data, plus one negative class called “Other” for the model to signal that none of the proposed countries is suitable. Countries are ranked by popularity, i.e., the most frequently encountered country is considered class #1, the second most popular as class #2, and so on. At the same time, to account for the individual characteristics of specific countries, the proposed country codes are also fed into the model.

As a result, this model combines the best aspects of majority voting (the ability to learn from any country, even if it appears only once in the training set) and multiclass with country classes (the ability to consider the specific characteristics of individual countries). Experiments have shown that the accuracy of this model has significantly increased for smaller countries without compromising the accuracy of larger ones.

Adjusting the Weights of Training Examples

The distribution of the top 18 countries' shares on Twitter (from training data) and Instagram (from geotags)

The distribution of the top 18 countries’ shares on Twitter (from training data) and Instagram (from geotags)

The distribution of countries on Twitter differs significantly from that on Instagram, as seen in Fig. 2. For example, Instagram’s popularity in Russia and Iran far exceeds that of Twitter. If the training data is used as is, the model will memorize the country distribution from Twitter and adjust its answers accordingly. The probability that the model will output Iran as an answer will be as low as Iran’s share on Twitter.

To avoid this, during training, additional weights were assigned to each training example, equal to the ratio of the corresponding example country’s shares on Instagram and Twitter.

Training Results

An accuracy of 97% was achieved in identifying the country with a coverage of 86% (i.e., for 14% of the accounts, the available information was insufficient for reliable decision-making, and the issued probability was below the specified threshold). If no threshold probability is used and predictions are considered for any accounts with at least some geoinformation, an accuracy of 93.4% is achieved. The accuracy figures were measured on a test set, i.e., on accounts that the model had never seen during training and hyperparameter tuning.

The interdependence between accuracy and coverage at different probability threshold values.

The interdependence between accuracy and coverage at different probability threshold values.

The importance of input features for the model.

The importance of input features for the model.

As seen in Fig. 4, the most important features are country_id (country identifier, allowing the model to consider the individual characteristics of specific countries), geotags from posts, blogger language, and audience geodata. However, this diagram does not accurately reflect the real importance since not all features are present in every account.

The importance of input features for the model. Features are normalized by frequency.

The importance of input features for the model. Features are normalized by frequency.

When normalizing the importance of features by their frequency in accounts, the leading features are those with the most accurate geoinformation: phone numbers (country code), geotags, and emoji (country flags).

We can delve even deeper and analyze the model using the SHAP framework2.

Feature importance using SHAP.

Feature importance using SHAP.

The further the points in Fig. 6 deviate from the center, the more significant the impact of the given feature. Blue points represent zero values, usually corresponding to the absence of the feature in the input data, while red points are non-zero, i.e., in this context, non-empty.

Feature importance detail (SHAP) for individual accounts.

Feature importance detail (SHAP) for individual accounts.

As shown in Fig. 7, the most “active” features are country_id, geotags from posts, language, and emoji.

Model Performance Example

To showcase the real-world results of the model, a demo sample of 2,000 accounts (based on test data that the model had never seen during training) has been created. The sample is a CSV file consisting of 5 columns:

  • instagram – Instagram account for which the country was determined
  • twitter – corresponding Twitter account, the source of “truth”
  • prediction – predicted country
  • truth – country from Twitter
  • probability – probability, interpreted as confidence in the prediction

The file can be downloaded here.

The sample includes results with a probability > 90%. As seen from the provided data, the quality of linking Instagram <–> Twitter is not perfect, and errors that degrade the measured resulting accuracy occur. Therefore, the actual model accuracy might be even higher than the measured 97%.

Geodata Visualizations

Finally, let’s take a look at a couple of additional visualizations of Instagram’s geographic data, obtained using the model described above.

Visualization 1

Let’s calculate the distribution of the number of Instagram accounts by country. Of course, it is impossible to determine the absolute number, as that would require data for all existing Instagram accounts. Instead, we can take a large enough sample and determine the percentage of all accounts belonging to each country.

The top three countries in terms of the number of accounts are _USA, Brazil, Indonesia_. The fewest accounts are in _African_ countries, _North Korea_, and _Greenland_.

The top three countries in terms of the number of accounts are USA, Brazil, Indonesia. The fewest accounts are in African countries, North Korea, and Greenland.

Visualization 2

The previous visualization is interesting, but it is obvious that the more populous a country is, the more Instagram accounts it will have. To eliminate dependence on population, we can calculate the percentage of the total number of Instagram accounts per capita for each country. This way, we obtain the popularity of Instagram among the population.

A completely different picture emerges. The top five countries where residents love Instagram the most are Cyprus, United Arab Emirates, Iceland, Qatar, Malaysia. These countries are colored light yellow, making them less visible on the map. Among the larger countries, Instagram is most popular in Brazil, Australia, USA. Instagram is least popular in Africa and Asia.

Conclusion

The project has been deployed into production and is successfully used by the client, significantly improving the accuracy of determining the residence of bloggers and their audiences, and reducing the number of customer complaints about inaccurate data. Currently, the project is also being used by companies that have purchased the rights to this product.

Possible improvements:

  1. Currently, the geotags being considered are simply summed, and their order is not taken into account (bag of geotags). However, there is often a clear temporal structure in geotags that corresponds to the geographic movements of the account owner. If this structure is taken into account, i.e., working with geotags as a time series, it may be possible to increase accuracy. For instance, the model could distinguish between situations when an account owner goes on a business trip and when they move to a new place of residence.
  2. Information from photographs is not currently used, as training a computer vision classifier takes a long time, and the project would not have fit within the planned timeframe. If this information were to be utilized, it would provide an additional data source, particularly relevant for accounts that do not use geotags and do not provide additional information in their bio. This would result in increased coverage.

  1. Zhi-Hua Zhou, Ensemble Learning↩︎

  2. Scott Lundberg, Su-In Lee (2017). A Unified Approach to Interpreting Model Predictions. arXiv:1705.07874 [cs.AI] ↩︎