How to Scrape Box Art for Retro Games: Legal Methods, Technical Tools, and Practical Workflows

You’ve spent months hunting down original cartridges and discs for your collection. You’ve catalogued them, cleaned them, tested them on actual hardware. But when you sit down to organize your digital library or create a front-end display for your retro arcade setup, you realize that quality box art is scattered across a dozen websites—some reliable, some questionable, many incomplete or low-resolution. The manual process of hunting down and organizing artwork one game at a time is tedious and time-consuming. Meanwhile, you’ve heard rumors about “scraping” tools that supposedly automate this work. The question that follows is practical but loaded: How do you actually do this legally and technically?

This matters because box art is more than decoration. It’s part of preserving the complete experience of retro gaming—the visual identity, the design language of an era, the tangible connection between memory and the physical artifacts that shaped your childhood or your collecting hobby. A well-organized game library with proper artwork elevates the entire experience of your setup, whether that’s a RetroPie installation, a dedicated arcade cabinet, or simply a personal cataloguing system.

Understanding the Legal and Technical Landscape

Before you download a single image, you need to understand where box art actually comes from and what the legal reality looks like—not the internet mythology, but the actual situation.

Copyright ownership and the gray zone

Box art is copyrighted material. The publisher owns it. Nintendo owns Nintendo box art. Sega owns Sega box art. That’s unambiguous. However, the actual enforcement of those copyrights for archival, personal use, and game cataloguing has been deliberately relaxed in practice, with some notable exceptions.

The distinction that matters here is between preservation archives (like IGDB, MobyGames, and TheGamesDB) and unauthorized redistribution for profit. The former operates in a space where publishers have largely chosen not to pursue legal action because these sites provide marketing value and prevent fragmentation of game history across dead links and abandoned servers. The latter—scraping images to resell, to create commercial compilations, or to undercut official re-releases—will result in a cease-and-desist letter.

Your use case determines your risk profile. Scraping art for a personal RetroPie installation is considered fair use in most jurisdictions and is rarely if ever pursued. Scraping art to create a commercial ROM pack and selling it is copyright infringement, full stop. The middle ground—creating a tool that helps others scrape, or distributing a pre-scraped collection—exists in legal ambiguity that varies by region and publisher.

Major sources and their policies

Three platforms dominate box art availability for retro games: TheGamesDB, IGDB (Internet Game Database), and MobyGames. Each has different terms of service regarding automated access.

TheGamesDB explicitly allows scraping for personal use and provides API documentation. They ask that you cache results locally rather than hammering their servers with repeated requests, and they maintain a clear distinction between automated access (allowed with rate-limiting) and commercial redistribution (not allowed).

IGDB (owned by Twitch) requires API authentication and rate-limiting. They’re more restrictive than TheGamesDB but still permit personal use. Commercial applications and high-volume requests require explicit permission.

MobyGames permits scraping but requests that you contact them first for any systematic collection. They’re the most permissive of the three if you ask.

Beyond these, cover art databases like CoverProject and BoxArtworkDB exist specifically for archival purposes. Some are more actively maintained than others, and quality varies significantly.

The Technical Reality of Scraping

API-based methods versus web scraping

There’s a crucial distinction between two approaches: API-based data collection and web scraping. APIs are the ethical, sanctioned way. Web scraping is the gray-area workaround when APIs don’t exist or don’t provide what you need.

An API (Application Programming Interface) is a structured, intended channel for accessing data. When TheGamesDB or IGDB provide an API, they’re explicitly saying: “Here’s the proper way to access our data in bulk. Use this, follow these rate limits, and you’re good.” Using an API respects the server architecture, doesn’t disguise your requests as human users, and doesn’t surprise the database owner with unexpected traffic.

Web scraping, by contrast, automates the process of visiting web pages the way a human would, parsing the HTML, extracting images and metadata, and storing them locally. It’s faster to implement because you don’t need API documentation. It’s also more fragile (any change to the website’s HTML structure breaks your scraper), more aggressive on servers, and easier to interpret as hostile.

The reason this distinction matters technically: APIs are designed to handle programmatic requests. They return structured data (usually JSON) that’s easy for your script to parse. Websites aren’t. You have to parse HTML with regex or DOM parsers, handle pagination, deal with JavaScript-rendered content, and work around rate-limiting. A properly designed API returns 100 results in 50 kilobytes. A web scraper might fetch 2 megabytes of HTML boilerplate to extract the same information.

Rate-limiting and server load

Here’s where the technical meets the ethical. When you run a scraper, every request is a request to someone’s server. If your script makes 100 requests per second, you’re consuming measurable server resources. If you’re the only person doing it, it’s noise. If 10,000 people use the same tool, that database goes down.

Rate-limiting solves this. It means: add a delay between requests (typically 1-2 seconds per request for courtesy; the API documentation will specify minimums). Identify yourself with a User-Agent string that includes who you are (not a generic browser), so the server owner can contact you if needed. Cache results locally so you don’t request the same image twice. Use compression where available.

APIs typically enforce rate-limiting automatically—they’ll return error 429 (Too Many Requests) if you exceed limits. Good scraper design implements rate-limiting before the API does, so you never hit that error in the first place. It’s self-regulating courtesy.

Practical Scraping Workflows

Method 1: Using TheGamesDB API (recommended for most users)

TheGamesDB is the most accessible option because it has clear API documentation and explicitly permits personal use scraping. Here’s how to actually do it.

Step 1: Get an API key. Visit thegamesdb.net/api, create a free account, and generate an API key. This is free and takes two minutes.

Step 2: Structure your request. TheGamesDB uses REST endpoints. A basic search for a game looks like this:

https://api.thegamesdb.net/v1/Games/ByGameName?apikey=YOUR_KEY&name=Super%20Mario%20Bros&include=boxart

That returns a JSON response with metadata, including URLs to box art images hosted on their CDN. The response looks something like:

{"data": {"games": [{"id": 1234, "game_title": "Super Mario Bros", "release_date": "1985-01-01", "images": [...]}]}}

Step 3: Parse and download. Use a language like Python with the requests library to fetch that JSON, parse it with json module, extract the image URL, and download the image with another HTTP request. Store everything in a local folder organized by console and game name.

Here’s a minimal Python example that demonstrates the concept:

import requests import json import time import os

api_key = "YOUR_API_KEY" base_url = "https://api.thegamesdb.net/v1" output_dir = "./game_art"

def search_game(game_name): url = f"{base_url}/Games/ByGameName?apikey={api_key}&name={game_name}&include=boxart" response = requests.get(url) return response.json()

def download_image(url, filename): os.makedirs(output_dir, exist_ok=True) img_response = requests.get(url) with open(f"{output_dir}/{filename}", "wb") as f: f.write(img_response.content)

game_data = search_game("Super Mario Bros") if "data" in game_data and "games" in game_data["data"]: for game in game_data["data"]["games"]: if "images" in game: for image in game["images"]: download_image(image["url"], f"{game['game_title']}.jpg") time.sleep(1) # Rate limiting

The key point: time.sleep(1) enforces a one-second delay between requests. That’s the courtesy built in.

Why this method works: You’re using the intended API, respecting rate limits, downloading once and caching locally, and following the terms of service. This is the right way to do it.

Method 2: IGDB API for modern games and richer metadata

IGDB (owned by Twitch) has better artwork for newer titles and richer metadata (release dates, genres, platforms, developer info). The trade-off is slightly stricter rate-limiting and requiring OAuth authentication.

To use IGDB, you need to set up a Twitch application, get a client ID and OAuth token, and include those in your request headers. The process is documented at api.igdb.com, but it’s more involved than TheGamesDB. You’ll need to add headers like:

Client-ID: YOUR_CLIENT_ID Authorization: Bearer YOUR_ACCESS_TOKEN

Then query endpoints like:

https://api.igdb.com/v4/games?fields=name,cover.url,box_art_url&search=Super%20Mario%20Bros&limit=10

IGDB serves images through a CDN that requires some URL manipulation to get specific sizes and formats. The documentation is thorough, and the API is rock-solid. Use this if you’re building something more sophisticated than a simple personal collection.

Method 3: Manual sourcing with bulk download tools

If you want to avoid scripting entirely, tools like LaunchBox (Windows) and Emulation Station (cross-platform) have built-in image scrapers that integrate with TheGamesDB and other sources. You load your game list, run the scraper tool, and it automatically fetches and organizes artwork. These tools handle rate-limiting, deduplication, and local caching for you.

This is the most user-friendly approach if you’re not comfortable with command-line tools or coding. LaunchBox is particularly robust; it can scrape thousands of games overnight without breaking anything.

Quality Control and Metadata Matching

The matching problem

Here’s a practical issue that the technical explanations gloss over: game titles aren’t unique. There are multiple versions of The Legend of Zelda across different platforms. There are licensed games released under slightly different names in different regions. Your game list might say “Mario Bros.” but the database has “Super Mario Bros.” Automatic matching is necessary but imperfect.

Most scrapers use fuzzy string matching (algorithms that find “close enough” title matches) combined with platform/year filtering. So the logic is: “Find a game with a title similar to ‘Mario Bros’ released in 1985 on the NES.” That works 85% of the time. The remaining 15% require manual verification or correction.

Here’s how to handle this:

1. Review results before committing. Don’t let a scraper run unattended and blindly accept every match. Spot-check the first 20-50 games to ensure the tool is matching correctly. If accuracy drops below 90%, you’ll need to adjust your search parameters or do manual corrections.

2. Use multiple fields for matching. Don’t match on title alone. Include platform, year, and developer if available. TheGamesDB includes all three in API responses, so use them.

3. Create a manual correction list. For the handful of games that won’t match automatically, keep a CSV or JSON file with corrected titles, platform mappings, and database IDs. Run the scraper, let it populate 90%, then manually fill in the remaining 10%.

4. Verify image quality before organizing. Downloaded images vary in resolution and quality. Check a sample before committing the full batch. If you’re seeing compressed, upscaled, or incorrect artwork, adjust your source.

Image formats and optimization

Box art comes in different formats: JPG, PNG, WebP. JPG is most common and smallest. PNG preserves quality but is larger. For retro game organization, JPG at 80-85% quality is a reasonable balance—you’re displaying cover art on a screen, not printing posters, so pixel-perfect preservation isn’t necessary.

If you’re building this for a frontend like RetroPie or EmulationStation, standard dimensions are 500×700 (portrait, typical box art) or 1000×600 (landscape, banner art). Sources vary, so you may need to resize. Use ImageMagick or FFmpeg in batch mode if you’re processing thousands of images.

A practical workflow: Download full-resolution from the source, store locally, generate thumbnails for the frontend. That way you have the original data preserved while keeping your interface responsive.

Building a Sustainable Long-Term Scraping System

Incremental updates and change detection

Once you’ve scraped artwork for your collection, you’re not done. New games release. Databases update metadata. Better-quality scans become available. Running a fresh scrape every few months keeps your library current without re-downloading images you already have.

Efficient updates use change detection: query the database for games updated since your last scrape, download only new or modified artwork, skip everything else. This reduces bandwidth and API calls significantly.

Implement this by storing a timestamp of your last successful scrape, then in subsequent runs, only query results modified after that date. Most databases support filtering by modification date in their API.

Error handling and resilience

Network requests fail. Servers go down. Images disappear. URLs rot. A robust scraper handles these gracefully:

Timeout handling: Set connection timeouts (e.g., 10 seconds). If a request hangs, move on rather than stalling the entire script.
HTTP status code checking: Verify that the server returned a 200 (success) response before processing the data. 404 means the image is gone. 429 means you’re being rate-limited. Handle each appropriately.
Retry logic: For transient failures (500 errors, timeouts), retry 2-3 times with exponential backoff (wait 1 second, then 2, then 4) before giving up.
Logging: Write detailed logs of what you downloaded, what failed, and why. When you have 5,000 games and 50 failed, logs tell you which ones to investigate manually.
Deduplication: Don’t re-download the same image. Hash the downloaded file, compare against your local cache, skip if it’s already there.

These aren’t fancy techniques. They’re reliability engineering basics. A production scraper—meaning one that runs reliably for years—includes all of them.

Metadata Beyond Images: When You Need More Than Box Art

Once you’ve solved the box art problem, you realize you also want game descriptions, release dates, developer info, ratings, gameplay screenshots, and maybe video trailers. This expands the scraping scope significantly.

TheGamesDB includes metadata in the same API responses as images. You can request release dates, genres, publishers, ratings, and brief descriptions. Store this alongside the artwork. The extra data is minimal (mostly strings and integers) and provides context that enriches your game library.

IGDB provides richer metadata but charges for high-volume access above the free tier. If you’re running a personal project, the free tier handles thousands of games. If you’re building a service that serves other users, you’ll hit the limits and need to negotiate with IGDB or find an alternative.

For gameplay screenshots and promotional video, the sources are more scattered. YouTube has trailers (which you shouldn’t download without explicit permission, though watching them for context is fine). Some games have promotional screenshots in databases, others don’t. Accept that you won’t find media-rich coverage for every game, especially for obscure or region-specific releases.

Avoiding Common Mistakes

Mistake 1: Not understanding your data source’s terms of service

Read the actual terms before you start scraping. Seriously. A five-minute read of TheGamesDB’s or IGDB’s terms of service prevents hours of frustration or legal issues. Know what you’re allowed to do with the data, what rate limits apply, and what constitutes misuse.

Mistake 2: Overloading the database with aggressive scraping

A script that makes requests as fast as your internet connection allows will be throttled, IP-banned, or worse. Rate-limiting is not optional politeness; it’s the difference between your scraper working reliably for years and being blocked after a week. Enforce delays between requests. Identify your requests with a User-Agent. Respect 429 responses immediately by backing off.

Mistake 3: Assuming URLs are permanent

Image URLs from databases sometimes change (servers get reorganized, CDNs are replaced, old formats deprecated). Don’t store only the URL and expect to fetch it months later. Download the image, verify it’s valid (correct file size, not corrupted), and store it locally. Keep the source URL as metadata in case you need to re-fetch, but don’t rely on it.

Mistake 4: Mixing different data sources without deduplication

If you’re pulling from both TheGamesDB and IGDB, you might end up with duplicate entries for the same game under slightly different titles. Use game IDs from the source databases for cross-referencing, or manually deduplicate by matching on platform + year + title. Otherwise, your final collection has redundancy and confusion.

Mistake 5: Not preserving the original source URL or attribution

Good practice: keep metadata indicating where each image came from (source database, image ID, URL, date retrieved). If the image is later contested or you need to re-fetch or verify it, you have the trail. It’s also ethical attribution—if you’re distributing your scraped collection, you want to acknowledge the source.

When Scraping Isn’t the Right Answer

Curated, licensed collections

Some publishers now release official collections with licensing agreements that include artwork. If you’re serious about preservation and don’t mind spending $20-50, these are cleaner than scraping. Nintendo Switch Online, for instance, includes a curated library with official artwork and metadata. You can’t scrape it, but you also don’t need to—it’s already organized.

Similarly, if you’re using a dedicated device like a TurboGrafx-16 mini or Sega Genesis mini, the artwork is pre-loaded and correct. The effort to scrape and organize is only necessary if you’re building a custom frontend or maintaining your own collection system.

Very old or very obscure games

Pre-1980 arcade games, unreleased prototypes, regional variants, and homebrew games often have no database entry. You can’t scrape what doesn’t exist. For these, you’re hunting manually—contacting collectors, searching archive sites, or commissioning artwork from designers. It’s not scalable, but for a small fraction of your collection, it’s acceptable.

When a commercial tool does it better

Tools like LaunchBox include intelligent scrapers that handle matching, deduplication, and organization automatically. If you’re building a RetroPie installation or a dedicated arcade cabinet and you don’t want to script anything, buying a $40 tool and running it for an hour is faster than writing and debugging a scraper. The cost/time trade-off depends on your project scope.

Legal Considerations and Risk Assessment

Let’s be explicit about the legal risk profile, because there’s a lot of misunderstanding here.

Personal use, non-commercial: Low risk. Scraping artwork to display in a private RetroPie installation or a personal cataloging system is considered fair use in most jurisdictions. Publishers don’t pursue this because it’s uneconomical and bad PR. No major legal action has ever been taken against someone for organizing their own game collection with artwork.

Distribution of scraped collections: Medium-to-high risk. Creating a “complete box art pack” and distributing it—even for free—treads into copyright infringement territory. You’re redistributing copyrighted material without permission. Publishers could send cease-and-desist letters. Whether they actually do depends on visibility and perceived harm, but the legal risk is real.

Commercial use: Very high risk. Using scraped artwork to create a commercial service, sell a collection, or include it in a product you’re monetizing is straightforward copyright infringement. Expect legal action.

Respecting terms of service: The APIs we’ve discussed (TheGamesDB, IGDB) specifically permit personal use scraping. As long as you’re following their rate limits and not redistributing their data commercially, you’re within their terms. This doesn’t eliminate copyright risk (the artwork itself is copyrighted regardless of API permissions), but it eliminates the risk of the database itself taking action against you.

The safest approach: scrape for personal use, rate-limit responsibly, don’t redistribute, and keep it private. That’s where the vast majority of personal projects live, and that’s where copyright holders don’t pursue enforcement.

Putting It All Together: A Practical Example

Let’s say you have 200 NES games and you want to scrape box art for a RetroPie installation. Here’s the actual workflow:

Step 1: Export your game list as a CSV or text file with one game name per line. Something like: Super Mario Bros., The Legend of Zelda, Metroid, Castlevania, etc.

Step 2: Set up a Python environment with the requests library installed. Create the script described earlier (or download a pre-made scraper from GitHub if you trust the source). Configure your TheGamesDB API key.

Step 3: Run the scraper on your game list. It queries TheGamesDB for each game, downloads artwork to a local folder organized by game name, and logs results. This takes 3-5 minutes for 200 games (due to rate-limiting delays).

Step 4: Review the results. Check the log for any games that didn’t match. Spot-check 10-20 images to verify they’re correct and good quality.

Step 5: For the handful of games that didn’t match or matched incorrectly, manually search TheGamesDB and download the correct artwork by hand. This is tedious but necessary—it takes 30 minutes for 20 games.

Step 6: Copy your artwork folder to your RetroPie installation (or wherever your game frontend expects to find images). Configure your frontend (EmulationStation, LaunchBox, etc.) to scan that folder and match images to games.

Step 7: Test. Load a few games and verify the artwork displays correctly. Adjust image naming or folder structure if needed.

Total time investment: 4-6 hours for a 200-game library. That’s reasonable. If you had 5,000 games, it scales to maybe 20-30 hours total (most of which is automated), but the manual review and correction increases proportionally.

The Bigger Picture: Why This Matters

Box art scraping is a practical problem, but it connects to larger questions about digital preservation, copyright in the internet age, and how we maintain the cultural artifacts of video games. Box art is part of gaming history. It reflects design trends, marketing strategies, regional preferences, and the evolution of game packaging from cartridge era to disc era to digital distribution.

By maintaining organized, preserved copies of this artwork—legally and ethically—you’re participating in informal but vital archival work. You’re ensuring that the visual identity of these games survives even if original publishers decide to delist or discontinue re-releases. You’re building a library that captures not just the games, but their context.

That’s worth doing right.