Don't Drown While Filling Your Data Lake

Publishers are starting to ramp up their investment in data, especially as they gain more prowess in programmatic advertising. Many are throwing around the idea of creating their own data warehouse.

Data is currency. Amazon uses data to give you a more relevant retail experience. Google uses data to organize all of the information on the Internet. Facebook uses data to map your social behavior (sometimes they wind up in trouble.)

Publishers need to determine what they want to use data for, and also if they have the resources to collect, review and manage that data well enough for an investment in a warehouse to be worthwhile. Otherwise, a data warehouse ends up more like a data lake, with quicksand at the bottom.



Focus on The Goal

Data can be currency, but every data point isn’t valuable enough to keep. Thinking that you should simply collect and park every data point that comes into your organization is bad idea. Storing it will be expensive. You will have trouble squeezing insights from a huge data set. And you run a bigger security risk. It’s highly unlikely that this scenario will reap you enough long term reward to overcome the early problems.

Instead, start with a clear goal, and focus on the data points that help you reach your goal. You might have a goal to normalize pricing across your inventory based on advertiser bidding patterns. In that case, you would focus on collecting real-time bid stream data from your demand partners. Or, you might be interested in understanding the market value of various pieces of advertising inventory on your website cross-analyzed with different audience groups. So you’d need a solution that merges data from your DMP and your ad servers. Every business problem is different, and so every data warehouse should look different.

Keep Your Head Out of the Clouds, Even if Your Data Is In the Cloud

Think of a data warehouse like a real brick and mortar building that will store your stuff. You need a clean, safe, secure storage facility. You need to be able to grant access to certain people and restrict other people. Trucks need to be able to pull up and drop data off at regular times, and you need to find a place to put incoming data that’s organized and works with what’s already there.

For publishers with limited resources, these responsibilities might stretch beyond the reasonable limits of their organization. Don’t let developer hubris get in the way of a prudent decision. You probably do not need your own servers or your own room in a custom data center. Amazon, Microsoft Azure or Google Cloud will likley end up being the best partner because they are a relatively full service and that’s OK. The most important points to cover are that your data is secure, organized and accessible, and can accommodate the influx of new data without becoming unmanageable.



  • Log level data

  • Try to get a sample if you can to understand how to answers the question below: (crazy if you are multiple pubs, not as bad but still sucks if you’re single…. Because is based Custom content key value pairs, or Audience Key Values)

  • Be aware that the amount of data is exponentially more than rolled up data most digital publishers and marketers are collecting and analyzing on their own.

  • A roll up strategy is needed to handle it. For example when you have this data, what do you plan on looking for / getting out of it.

    • Do you need to see Bids, by Page or Content

    • Do you need to see Clearing price by Advertiser or Exchange

    • Do you need to see certain fluctuations by Time (meaning holidays and day of week)


Only when you roll it up, can you actually use it. If you can’t answers questions from it, then be careful collecting this data until you have at least figured out your top 3 questions.

  • And can you act on this when you have it????

Do you raise floors during winter if your content gets higher bids in sick season?

Do you change Guar. direct advertiser CPMs before this season?

The Opposite of Set It and Forget It

Speaking of unmanageable, data has a habit of spinning out of control, and you’ll need a lot more than an organized warehouse to keep it in shape. Taking the programmatic example, every day, a typical publisher pulls data in from 10, 20 or 30 different data sources and every day, there are errors in that data. You’ll need to have the resources to address errors within millions or billions of data points before you simply back up the truck and dump the info into your warehouse. On top of that, APIs stop working, field names change, partners change their policies, and you need to be on top of every minute change or you fall victim to the “garbage in, garbage out” problem. At that point, your entire warehouse is compromised.

This is where the warehouse analogy really matters. There is no brick-and-mortar warehouse that sits unattended where trucks simply back up and dump merchandise. There are people managing which door the trucks come to, people driving the forklifts, people checking, recording and cataloguing each delivery, and janitors keeping it clean. These labor costs are well understood in the world of physical storage, but are often dangerously neglected in the world of data.

I know of one publisher that put all warehouse management responsibility on a single person. When that person left the company, their data warehouse did sit unattended as tons of data piled up. Their storage costs and risk piled up, too. It was several months before the finance department noticed the mounting costs and figured out where they were coming from.

The moral of the story is that collecting and storing data is complicated. It requires a plan and goals, management and oversight. Otherwise, all your valuable insights will be sucked into the quicksand at the bottom of the lake.