<aside> <img src="/icons/book-closed_blue.svg" alt="/icons/book-closed_blue.svg" width="40px" /> MuckRock’s User Guide gives you everything to make the most out of MuckRock’s suite of tools. If this has helped you, consider donating to support our work!

</aside>

DocumentCloud’s Scraper Add-On lets you scrape and optionally crawl a given website for documents to upload to DocumentCloud.

<aside> <img src="/icons/electric-plug_gray.svg" alt="/icons/electric-plug_gray.svg" width="40px" /> Add-Ons give DocumentCloud superpowers. Learn more about Add-Ons.

</aside>

You can track all Add-On runs on DocumentCloud, but the DocumentCloud Scraper Add-On will try to run five times before it gives up and lets you know that it has failed. Your Add-On schedule and history will also include the details of any error message.

When the Scraper Add-On fails, you should receive an email with more detail about the error that the Add-On encountered.

This list is not comprehensive — there are a lot of reasons the Add-On might fail to run, but these are the most common:

  1. You’re not verified to upload documents on DocumentCloud

    The Scraper Add-On requires that you are able to upload documents on DocumentCloud. This is only available to Verified users.

  2. User agent blocking

    Site admins frequently prevent programs from access to the site by filtering certain user agents. When you are browsing a website in your own web browser, a user agent that identifies your web browser is attached to any network requests with the site. Many Python programs use the default Python user agent, which is commonly blocked by site site admins.The DocumentCloud Scraper Add-On identifies itself to sites with a custom user agent so it will pass a rudimentary filter for the Python user agent, but it does not prevent the site from more advanced user agent blocking or from being added to a block list at a later time.

    Example: https://wwww.septa.org/procurement/bids/?bid_category=with-eps&bid_status=closed (Note: we added a unique identifying user agent to the Scraper Add-On, so this site now works, but it does filter out the Python user agent and was previously failing.)

  3. IP & Geo-blocking

    Administrators often block programs from certain IP addresses or users from entire geographic locations from accessing the site. Because DocumentCloud Add-Ons rely on GitHub Actions which have their own IP addresses allocated, we cannot control which IP address is used or when the Add-On may be blocked from scraping by the site in the future. We discovered the below website is completely unavailable in certain parts of the world.

    Example: https://protectthevote.com/legal_activities/

  4. Dynamic sites

    Sites heavily reliant on JavaScript or PHP don’t load all of the site content until some user engagement happens- a click, scroll, search, etc. The DocumentCloud Scraper Add-On relies on the Python requests library which is unable to handle these sort of cases. You can often tell you are looking at a dynamic page because the URL does not change when you interact with the site and new items get loaded. Scraping these kinds of sites usually requires more advanced options that require programming like Selenium or Playwright.

    Example: https://www.courts.michigan.gov/case-search/

  5. CAPTCHA

    Often combined with other types of anti-scraping web design choices, CAPTCHAs are used by site administrators to minimize scraping activity. The Scraper Add-On does not support sites which use CAPTCHAs.

    Example: https://cases.ra.kroll.com/puertorico/Home-DocketInfo


If you need further assistance in evaluating or troubleshooting the Scraper Add-On, connect with us on Slack or contact us at [email protected].

<aside> <img src="/icons/chat_green.svg" alt="/icons/chat_green.svg" width="40px" />

Could this guide be more clear? Submit feedback to help us improve our documentation!

</aside>