Twarc: Learning to Extract & Understand Twitter Data

I’m currently working on a research to document and explore the Twitter resistance movement that’s formed in the wake of the Trump administration. In order to visualize, save, and explore the vast number of tweets that have appeared under various accounts and hashtags during this time, I decided to turn to Twarc.

Developed by Ed Summers for the Documenting the Now project, Twarc is a command line tool and Python library for archiving Twitter JSON data. Using the Twitter API, users can collect tweets, hashtags, trends, followers, friends, retweets, replies…basically, anything publicly available on Twitter can be requested here. Using various commands, you can even set up libraries of certain hashtags and accounts, to track trending information. Twarc is one of four primary tools developed by Documenting the Now to work with Twitter data, all with varying levels of technical proficiency. Twarc, like these other tools, reflect an effort to chronicle historically significant events and consider ethical ways of working with social media content. Pitched from a mindset geared towards “archivists of the future,” Twarc offers a way to think about collecting and archiving Twitter data in forms that prioritize context, safety, and usability. And though other tools of this type may exist for this type of work, Twarc seems best prepared to handle long-term curation and expansive requests of Twitter data. In addition, the DocNow team behind its use champions many of the questions around social media activism that may be placed in conversation with my aims for this project.

Prior to this project, I hadn’t used Twarc before. But I’d seen a lot of people talking about it, and I had worked with a lot of the other tools that helped Twarc function. Users operate Twarc through the command line interface. Though I won’t go into the details of the command line interface here, it’s particularly helpful in performing research on large datasets. Users input text-based commands that allow for specificity. Twarc operates as a Python library, which means it uses commands developed in the language to run various scripts for the API.

My experiences with both the command line and Python are limited, even after doing this assignment. I mostly know how to follow instructions and how-to guides. Documenting the Now’s GitHub page provides precise instructions on the operations of the tool, starting with installation procedures and following through with utilities and filters for the various metadata collected. And there are plenty of opportunities to expand on Python as a language – commands are fairly easy to read with little context, as readability Is key to the code’s success. I ran into a few issues in getting Python running and operating correctly, and I’m not entirely sure if I could replicate the process again without outside assistance. (I broke the cardinal rule- save your Terminal shell!) But while a learning curve exists to make the most of tools like Twarc, it’s easy enough to follow through the vast number of guides available. Hopefully, my next run through will go a little bit more smoothly!