Vision models heavily rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only understand knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet—where billions of images are uploaded each day.
We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next.
We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30–40 hours.
Given unlabeled data for a target task, Internet Explorer searches the Internet to progressively find more and more relevant training data via self-supervised training.
We draw an analogy to robotics, where an agent takes actions in an environment and receives observations and a reward. In Internet Explorer, the Internet is the environment, text queries are our actions, images are our observations, and we define a reward that incentivizes queries that return relevant images.
Internet Explorer iteratively repeats 4 steps to find and learn from relevant Internet data. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images are relevant to the target dataset, and prioritizing what to search for next.
Click on each step to learn more.
We sample text queries from a concept distribution that is updated after each iteration. This distribution is implicitly defined via our reward estimates for each concept in our vocabulary (Wordnet), and is initialized uniformly. Optionally, we can use a pre-trained language model, like GPT, to help induce more visual diversity in our queries.
Click outside of the figure to return to the overview.We use the text queries sampled in Step 1 to query text-to-image Search Engines (e.g., Google, Flickr, etc.) for images. We download the top 100 images for each query, and search around 256 queries per iteration. We download these ~25k images in under 5 minutes using parallelism and caching of repeated queries.
Click outside of the figure to return to the overview.Next we perform self-supervised training on the downloaded images. Any common self-supervised pretext task or algorithm can be used. We train a ResNet-50 using MoCo v3 using a combination of the target dataset, the newly downloaded "candidate" images, and a replay buffer of previously downloaded images that were deemed to be relevant training data.
Click outside of the figure to return to the overview.Finally, we update the concept distribution based on a self-supervised relevance reward. The reward for each "candidate" image is its average cosine similarity to its k-nearest neighbors in the target dataset using the current model's feature representations. We then aggregate the image-level rewards to the query-level and fit a regression model using text-embedding features of the queries to predict the rewards for unseen queries in the next iteration.
Click outside of the figure to return to the overview.Given unlabeled data for a target task, Internet Explorer searches the Internet to find progressively more relevant training data for a target task without any supervision.
In just 40 hours of web exploration and training on a single GPU desktop, Internet Explorer improves representation quality across several target datasetes (measured via linear probe accuracy on the target dataset). Our active querying approach enables much more effective representation learning than a random baseline—a proxy for random training on all searchable Internet images. In most cases, our model even outperforms linear-probed CLIP (a strong oracle), which is remarkable as Internet Explorer uses just 2.5% as much compute and 0.5% as much data!
Birdsnap | Flowers | Food | Pets | VOC2007 | FMoW | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Base Model | 39.9 | 94.6 | 78.3 | 85.3 | 58.0 | 48.8 | ||||||
Random exploration | 39.6 | (-0.3) | 95.3 | (+0.7) | 77.0 | (-1.3) | 85.6 | (+0.3) | 70.2 | (+12.2) | 49.3 | (+0.5) |
Internet Explorer | 62.8 | (+22.9) | 99.1 | (+4.5) | 84.6 | (+6.3) | 90.8 | (+5.5) | 79.6 | (+21.6) | 50.6 | (+1.8) |
CLIP (oracle) | 57.1 | 96.0 | 86.4 | 88.4 | 86.7 | 37.5 |
@inproceedings{li2023internet,
title={Internet Explorer: Targeted Representation Learning on the Open Web},
author={Li, Alexander C and Brown, Ellis and Efros, Alexei A and Pathak, Deepak},
booktitle={International Conference on Machine Learning},
year={2023},
organization={PMLR}
}