In this post, we examine the ways that computers can be used to automatically tag image files. There are some important differences in the approaches you may use. We will start with a discussion of the two main Machine Learning methods.
Let’s examine the difference between services that create a set of static tags and those that operate as black box services. There are advantages and disadvantages of each.
As of this writing, there are dozens of computational tagging services that can automatically create tags for your images. These services can analyze an image, and return a list of likely tags that describe the visual content. This can include a description of objects, activities, people and other situational characteristics. Each of these tags is typically accompanied by a confidence score that indicates the certainty for any particular tag.
Computational tags can be written into your image database as static metadata, meaning that it won’t change unless someone tells it to. You should be able to see these tags, map them to appropriate fields, and decide to accept or reject them, just like metadata that is added by a person.
Essentially, you “own” the tags.
STATIC TAGGING SERVICES
There is a rush of services to become market leaders in the creation of static tagging offerings. These include the big players like Google Cloud Vision, Amazon Rekognition, and Microsoft Azure Cognitive Services. There are also a lot of startups like Clarifai going down the same path. As of this writing, most of these services operate in a similar manner.
Tagging by API
Static tags are usually provided by means of an Application Programming Interface (API). An API allows one service (e.g., a DAM application) to talk to another (e.g., a computational tagging service). The DAM can send photos for analysis, and the tagging service sends back a list of tags, usually in the form of a JSON file. The DAM application is then responsible for adding the tags to the database for each image.
The figure below shows what this JSON file looks like.
In most computational tagging services, a copy of an image is sent to the service through an API and the resulting tags are sent back as JSON. In this example, Microsoft Cognitive Services assigned the tags “people_portrait.” It also recognized the person in the photo as Gwen Ifill and drew a rectangle around her face. You can also see a very high confidence rating -– greater than 99%.
Local-based or cloud-based
Most tagging services will be based in the cloud. These services leverage massive and ever-improving databases along with high-power cloud computers. They are able to rapidly improve because they see millions of images, and may have many users providing feedback.
Some people will not want to send their images out to external services for analysis. The images may be highly confidential, or perhaps the collection manager is just uncomfortable with letting lots of images run through external services.
There are also a number of tagging services that can run on your own computer, without having to go out to the cloud. Lightroom Classic, for instance, does its face tagging on your local computer, and does not send images through its cloud. Immaga is a commercial service that can also run on your own computer.
In a black box service, the computational analysis is not a one-time operation. Instead, the images are continually reprocessed as the service gains new capabilities, or as it gains a better understanding of you and your collection. As the service learns, the search results should continue to improve. These services may never show you all the tags they currently store for an image since they expect to make a better set of tags at some point in the future.
An important part of black box functionality is the search capabilities inside the box. Conventional metadata is generally used in a filter operation (e.g., hide all images that don’t have the tag “Kensington, Maryland”). Black boxes can function more like Google, where misspellings, synonyms and related terms can produce results even when there is not an exact match.
You don’t own or control the data
When using a black box service the tags and other information typically resides within the service. You don’t own it. Instead, you lease access to it. This is a structural problem that is going to be hard to avoid, at least in the visible time horizon.
The best black boxes don’t just include a set of tags. They have deep semantic graphs of what a tag may mean. This is not something they can export to you, should you decide to leave the service. Likewise the data they have about you is probably not actionable, even if you could get a copy. (Your search history, what you like, where you go, etc).
And the semantic processing they do is also going to stay within the service. (Does “ship sinks” indicate a maritime disaster, or plumbing fixture retailing?)
For some people, particularly in the consumer realm, this lack of control may be fine. For many institutions, this can be a deal-breaker.
Good for language localization
Working with multiple languages is an inherent advantage of some black box tagging services. In many cases, the semantic understanding of an image is not tied to a particular language. Google knows that “car” in French is “voiture” so it can provide similar results. (Google also knows that someone searching on “voiture” is interested in a French-based search of cars, and may be more likely to want a Citroën than a Ford.)
As black box tagging services continue to improve, we’ll probably see them become particularly popular for collections that need to serve multilingual audiences.
Most black boxes ignore your tags
Most of the current efforts to build great black box tagging largely ignore any data that the user bothers to put on the photo. (The main exception seems to be person tagging, which uses your tags to help learn who individuals are). This means that they are often ignoring the most important data in favor of more trivial information.
Most of the examples I’ve seen seem to expect that, given enough horsepower, the machine will learn everything useful, and be able to replace the human. But there is often a lot of context or backstory that is unknowable to the machine. (Why was the picture taken? Why was it uploaded?)
I think that the problem of integrating machine learning and human tagging/curation is being underestimated. (And, yes, that’s one of the things we are working hard on.)
Which way to go?
Eventually, we’re likely to get a really useful hybrid of static tags, black boxes, crowdsource, and human curation. But it does not really exist right now. So what is the best course of action. Here are my thoughts.
- Black boxes are great for consumers. They are less likely to make their own tags, and more likely to get a big boost from some basic machine learning optimization.
- Tagging Services are probably better for organizations. Given the early stage of computational tagging, it’s likely that services and strategy are going to evolve relatively quickly. So I don’t think it’s time to commit to any single service for the long term. That means that “owning” the tags is important. Additionally, static tagging services allow the collection manager to monitor the service, and see when new capabilities rise to the level of usefulness. Status tags also tend to integrate better with the human tagging and curation that most collections depend on.