Building Your Own User Database for Fun and Profit (and Re-targeting)

Facebook does it, Google does it, Twitter does it. There are many companies that create databases and lists for you to be able to re-engage and re-target your mobile app users. But why not do it yourself and be independent of any 3rd party to tell you who your users are. Today, as a first article of a series, we want to show you the theoretical basics of user databases and why you should run them yourself.

Who should read this?

You have an app and you are running user acquisition campaigns aka. app marketing to get new users into your app? You may even already use advanced tracking systems like to identify where your users are coming from? Great, but what to do about those users once they start or stop using your app?

In order to maximize your LTV you need to re-engage them, either via targeted ads on e.g. Facebook or via push notifications custom tailored to them. If this is you, read on.

Why do it yourself?

As initially stated, there are many companies that can use tracking data from your app to generate user lists to know which users to re-engage. The big inventory providers like Facebook, Google or Twitter have this already integrated into their platforms. Companies like Applovin, Appboy or Tapcommerce offer it as a service often combining it with push capabilities to leverage multiple re-engagement channels.

The main problem with all of them is that you are dependent on a 3rd party to store your user data, that, at least to a certain degree, will sell exactly this data back to you. In order to get started with a re-targeting campaign you typically need to populate the provider database by forwarding your tracking data for a couple of weeks. And when you feel like changing the provider the whole game starts again.

So why not take this into your own hands and take your app marketing to the next level?

Basic problems to solve

There are 3 basic problems that need to be solved before we can run our own re-engagement campaigns by providing IDFA or Google Advertiser ID lists to the inventory provider of our choice or sending custom push messages to the right users.

1. Importing data

First, we need to get all the installs, sessions, events plus any segmentation data from your app. If you are using an app tracking provider you usually have the option to receive a daily export of all your raw data.

More advanced providers like allow you to set up real-time callbacks to stream your data ad-hoc into your data warehouse. In this case you need an HTTP endpoint to receive those callbacks and save them to your database, which leads us to the next point.

2. Storing user profiles

The most crucial point in maintaining your own user database is the way you will structure the stored data. The goal of a user database is the ability to query quickly for users that match certain criteria. This means we need to optimize our data structure in order to enable these kind of queries. Interestingly enough, it matters less what database technology you use but how you store the data.

For the sake of argument, we will use a row-based database as example, column stores and document-based databases with map-reduce capabilities will work quite similar in our use cases.

Let’s look at the way tracking data from apps is typically stored:

Event-based databases

| created_at | event_name | device_id | segmentation_data |
|            |            |           |                   |

Each row is representing an event (e.g. install, session, purchase). Those events can also be split up into separate tables for each event type, but the basic principle is the same.

Segmentation data is anything that further qualifies a user, for example:

| ... | device_name | device_type | language | country | os_name | os_version | app_version | timezone |
|     |             |             |          |         |         |            |             |          |

Additionally you most likely want to store the attribution data for your users as well:

| ... | network_name | campaign_name | adgroup_name | creative_name |
|     |              |               |              |               |

This kind of schema has a couple of problems:

  • one row per event leads to huge databases, typically making sharded/distributed databases necessary because each attribute being stored over and over again for all the events of a user takes up a lot of space
  • trying to look up single users can lead to having to scan the full dataset to collect all his events

Querying a database like that for all users that have spent more than $10.00 and have been inactive for 3 days would result in complex joins or multiple expensive map-reduce steps.

In our use case this approach would pretty much generate a giant database that’s almost impossible to query. So what’s the plan B?

User-based databases

The alternative to storing each event individually is to store one record/row per user. This moves the database load from inserts to updates and you may want to consider this picking your technology.

The underlying concept for this kind of database is to reduce/aggregate the individual event data into columns of our user database by triggering updates on a given user row. This requires some planning ahead on what questions we want to be able to ask. A typical case is a column for the sum of e.g. revenue, session count or time spent. Another common pattern is to have a first and last occurrence time stamp column for events in your app along with fields for installed_at.

This is how a record may look like:

| installed_at | device_id(s) | total_revenue | first_event_time | last_event_time | total_event_count | attribution_data | segmentation_data |
|              |              |               |                  |                 |                   |                  |                   |

To be able to send push messages to our users and to mitigate advertiser id changes we can use following fields for our device ids.

| ... | idfa | idfv | android_id | google_advertiser_id | mac_address | push_token | your_indivdual_user_id |
|     |      |      |            |                      |             |            |                        |

Of course this schema doesn’t only have upsides. Following issues are to be considered:

  • high update frequency may not be feasible for all databases (table bloat)
  • certain analytics information is lost (DAU, sessions per day)
  • importing data from dumps or callbacks requires more complex logic than simple inserts

However those issues can be solved and as long as we are not trying to use this database for general analytics workload we’ll be fine.

The two most important benefits for this kind of structure far outweigh it’s drawbacks:

  • reduces number of records by 10-100X (based on actual data)
  • enables simple conditional queries for users

With this database approach we can tackle the last of our problems:

3. Retrieving user lists

Given that we are using the tables mentioned before we can run very helpful queries to identify users in our app that we want to use in e.g. a custom audience on Facebook and make special offer to.

Imagine you want to get all your users that have been inactive for 2 weeks:

SELECT idfa FROM users where last_session_time < '2014-10-20';

Easy…Let’s look for high rollers that haven’t updated to your latest version (2.0) and have been inactive for 7 days and send them a push notification.

SELECT push_token FROM users where last_session_time < '2014-10-27' and app_version < '2.0' and total_revenue > 50.0;

Or learn which kind of devices are used to spent most in your app:

SELECT device_name, AVG(revenue) FROM users GROUP BY device_name;

So that gives us an excellent basic framework for retrieving user segments and behavioural data on-the-fly. This is often enough. It is relatively easily extendable, to make it even quicker and easier to work with for marketing folks. From here, if you wanted to help your marketing folks work with it more intuitively, a simple API and a connected UI could let them retrieve their lists and segments right from the browser.


The scope of this article was to discuss the theoretical base for a system to store and retrieve device id lists with the purpose of re-engaging users of a mobile app.

We examined the available storage schemata and found that a user based database would be best suited for this task.

Over the course of the next weeks we will build an open source prototype of this concept and publish a series of articles about it here for you to participate. The goal will be to establish a standard that allows partners to get started faster with re-targeting campaigns by using a publishers internal database (via an HTTP API) and publishers to take ownership of their data. Till then,

Have fun.