Internet Explorer:

Targeted Representation Learning on the Open Web

1Carnegie Mellon University, 2UC Berkeley

*Equal contribution

ICML 2023

Abstract

Vision models heavily rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only understand knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet—where billions of images are uploaded each day.

We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next.

We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30–40 hours.

Video

Learn a task-specific model by exploring the web

Given unlabeled data for a target task, Internet Explorer searches the Internet to progressively find more and more relevant training data via self-supervised training.

A self-supervised "online" agent

We draw an analogy to robotics, where an agent takes actions in an environment and receives observations and a reward. In Internet Explorer, the Internet is the environment, text queries are our actions, images are our observations, and we define a reward that incentivizes queries that return relevant images.

Method Overview

Internet Explorer iteratively repeats 4 steps to find and learn from relevant Internet data. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images are relevant to the target dataset, and prioritizing what to search for next.

Internet Explorer method. Internet Explorer method step 1: sample text queries. Internet Explorer method step 2: search the Internet Internet Explorer method step 3: self-supervised training Internet Explorer method step 4: update concept distributions
Method Overview

Click on each step to learn more.  

Step 1: Sample Text Queries

We sample text queries from a concept distribution that is updated after each iteration. This distribution is implicitly defined via our reward estimates for each concept in our vocabulary (Wordnet), and is initialized uniformly. Optionally, we can use a pre-trained language model, like GPT, to help induce more visual diversity in our queries.

Click outside of the figure to return to the overview.

Step 2: Search the Internet

We use the text queries sampled in Step 1 to query text-to-image Search Engines (e.g., Google, Flickr, etc.) for images. We download the top 100 images for each query, and search around 256 queries per iteration. We download these ~25k images in under 5 minutes using parallelism and caching of repeated queries.

Click outside of the figure to return to the overview.

Step 3: Self-supervised Training

Next we perform self-supervised training on the downloaded images. Any common self-supervised pretext task or algorithm can be used. We train a ResNet-50 using MoCo v3 using a combination of the target dataset, the newly downloaded "candidate" images, and a replay buffer of previously downloaded images that were deemed to be relevant training data.

Click outside of the figure to return to the overview.

Step 4: Update Concept Distributions

Finally, we update the concept distribution based on a self-supervised relevance reward. The reward for each "candidate" image is its average cosine similarity to its k-nearest neighbors in the target dataset using the current model's feature representations. We then aggregate the image-level rewards to the query-level and fit a regression model using text-embedding features of the queries to predict the rewards for unseen queries in the next iteration.

Click outside of the figure to return to the overview.

Internet Explorer's queries improve over time

Given unlabeled data for a target task, Internet Explorer searches the Internet to find progressively more relevant training data for a target task without any supervision.

Pets target dataset. Pets progression over iterations.
Flowers target dataset. Flowers progression over iterations.
Food target dataset. Food progression over iterations.
Birdsnap target dataset. Birdsnap progression over iterations.
VOC target dataset. VOC progression over iterations.

Improved Representation Quality

In just 40 hours of web exploration and training on a single GPU desktop, Internet Explorer improves representation quality across several target datasetes (measured via linear probe accuracy on the target dataset). Our active querying approach enables much more effective representation learning than a random baseline—a proxy for random training on all searchable Internet images. In most cases, our model even outperforms linear-probed CLIP (a strong oracle), which is remarkable as Internet Explorer uses just 2.5% as much compute and 0.5% as much data!

Improved representation quality (linear probe accuracy)
Birdsnap Flowers Food Pets VOC2007 FMoW
Base Model 39.9 94.6 78.3 85.3 58.0 48.8
Random exploration 39.6(-0.3) 95.3(+0.7) 77.0(-1.3) 85.6(+0.3) 70.2(+12.2) 49.3(+0.5)
Internet Explorer 62.8(+22.9) 99.1(+4.5) 84.6(+6.3) 90.8(+5.5) 79.6(+21.6) 50.6(+1.8)
CLIP (oracle) 57.1 96.0 86.4 88.4 86.7 37.5

BibTeX

@inproceedings{li2023internet,
    title={Internet Explorer: Targeted Representation Learning on the Open Web}, 
    author={Li, Alexander C and Brown, Ellis and Efros, Alexei A and Pathak, Deepak},
    booktitle={International Conference on Machine Learning},
    year={2023},
    organization={PMLR}
}