<aside> <img src="/icons/book-closed_blue.svg" alt="/icons/book-closed_blue.svg" width="40px" /> MuckRock’s User Guide gives you everything to make the most out of MuckRock’s suite of tools. If this has helped you, consider donating to support our work!
</aside>
A couple weeks ago a DocumentCloud power user wrote in seeking to Scrape the USCIS reading room.
The reading room contains over 1,700 documents spread across over multiple pages. Even when I changed the drop-down to show 100 documents per page, it would be 18 pages’ worth of content.
We could feed each of these 18 pages into the Scraper Add-On or we could write a custom scraper in something like Selenium to click through each page and collect the documents as it goes.
Both of these approaches left something to be desired, and after the initial scrape, it would be nice to keep on top of new documents posted.
I noticed at the bottom of the page there was an RSS Feed. Although the RSS feed does not go back in time and capture all of the older documents, it does keep up with newly posted documents to the reading room. In fact, a new document appears on the feed in the last month- October 31st.
Our wonderful first round Gateway Grantee Jeremy Singer-Vine of the Data Liberation Project wrote an Add-On called RSS Document Fetcher that allows you to provide a link to an RSS Feed, which will automatically fetch the documents for you and upload them to DocumentCloud. You can schedule the fetcher to run hourly, daily, or weekly to keep track of the feed.
First, log into DocumentCloud and open the RSS Document Fetcher Add-On.
The Add-On run fields are pretty straightforward.
Provide a link to an RSS/Atom feed, an access level for the documents you fetch, a source for the documents, a project title or ID where you’d like the documents to be stored and a feed name, which is just a means for you to keep track of your fetcher. You can optionally provide a Slack webhook to receive Slack notifications when new documents are fetched and you may select to get notified when new documents are fetched by email with the “Notify on new documents” check box.
Then, you can schedule the frequency of the Add-On and click dispatch.
After you click dispatch, the Add-On will run once and pick up the first ten new documents it sees immediately and then will run at the frequency you set after that, picking up ten new documents each time it runs. The fetcher will continue running at that frequency and pick up new documents, if there are any.
I’ve stored the results of the RSS Fetcher on the USCIS Reading Room in a public project which you can explore.
<aside> <img src="/icons/chat_green.svg" alt="/icons/chat_green.svg" width="40px" />
Could this guide be more clear? Submit feedback to help us improve our documentation!
</aside>