Tamr Achieves Premier Technology Partner Status With SnowflakeLearn More

Understanding Catalog from a Developer’s Perspective


Building for Real-World Application

Screen Shot 2015-09-24 at 23.37.25

As a developer, sometimes it’s hard for me to see past the module I’m currently building and look at the broader context. About a month ago, I hit that point in developing Catalog (our new metadata cataloging tool that launched today).

I had called over my technical lead, Nik, to show him — proudly, like a boy who had caught a toad — some nuance in how I had configured the table to work. Feeling very satisfied in having overcome a technical hurdle, I was disappointed when the unimpressed Nik asked a question I’ll never forget: “Yes, but what does the app do?” The bubble that had formed around me and the current module I was tinkering with — a bubble that had no room for the real-world application of Catalog — popped right then.

What does Catalog do? What should it do? These questions rattled around in my mind for days. I pulled all sorts of people into lofty conversations about the purpose of Catalog and asked data scientist friends about their daily workflow — all in an attempt to answer Nik’s question with more than just “look at how you can select things in the table!”

When I arrived at a satisfactory answer, I outlined it in an email to the Catalog team, excerpted here:

What Catalog does is simple, almost to the point of being obvious. The tooling it provides can be categorized into two main workflows: discovering available data sets and curating the catalog.

Enterprises often have immature methods of communication surrounding their data, especially among disparate parts of the organization. A scientist in Antwerp might not know about extremely relevant experimental results from the organization’s labs in Atlanta, and therefore the scientist’s own experiments suffer — or perhaps could be duplicative and unnecessary. Similarly, a marketing team in Tokyo might buy data about potential customers that their sibling team in San Francisco already has.

Simply put, there is data somewhere else in the organization that a perfectly timed and placed phone call could discover. But in practice such word of mouth or human memory-dependent methods will fall shorter and shorter as the scale gets larger and larger.

This brings us to the first main workflow of Catalog: discovering available data sets.

Catalog essentially makes that exact phone call readily and easily available to the scientist in Antwerp and the marketing team in Tokyo. If the scientist could see just the names of the datasets in the organization, they could run a text search for “silkworm” and see what relevant information the company already has. If the data sets were tagged, the scientist could further filter on the tag “Experimental Results” and find the data for that relevant experiment run in Atlanta. If the data sets showed who the real-world owner of the source was, including their contact information, the scientist could be on the phone with the team lead of the Atlanta labs within minutes, saving time and resources for the Antwerp labs and the organization at large.

But this information isn’t magically available for exploration upon installation. The sets need to be uploaded and the tags and the source owner information need to be added. This is the second main workflow of Catalog: curating the catalog.

The team lead of the Atlanta labs needs to upload (at least a reference to) the silkworm experiment data to Catalog, needs to make sure it’s named appropriately, needs to tag it with the “Experimental Results” tag, and needs to add himself or herself as the Source Owner. At scale, with every experiment and any other relevant data set owned by the Atlanta labs, this could be a tedious undertaking.

Catalog must make both of these processes as functional, user-friendly, intuitive, and scalable as possible. If we make catalog curation and data set discovery easy-to-use and valuable to the scientists and team leads and marketing teams of the world, Catalog will have done its job.

A corollary to that conclusion is that the worth of every effort we make in developing Catalog can be measured by how directly the effort aids those two endeavors. This was the gap between development and real-world application that Nik’s question was meant to bridge. With this understanding I could look back at my newly configured table selection and appraise it for its actual value. Does being able to select items in the table help people curate their catalog, or discover data sets? The answer of course is still yes, but understanding the how is where the difference lies. Being able to connect something as low-level as table selection to solving the actual problem our customers have clads the entire development process in purposeful iron.
Postscript: A field engineer in our organization, Alex Klarfeld, happened to validate this conclusion for me. Alex started asking around for data that people at Tamr used for personal purposes — demoing, testing, so forth. He had the idea to get it all in one central location so everyone in the company could improve their ability to test and demo via a better selection of data. In other words, as a ~60 person company we already had a need to improve how we communicated about what data we had. So naturally, Alex spun up a Catalog instance to fill that need.

Sam Roberts is a Software Engineer at Tamr.