Research Datasets

Twitter Data

Data Description

The Twitter data were collected by the Geoinformation and Big Data Research Lab at the Center for GIScience and Geospatial Big Data (CeGIS) for academic research purposes. This is a live dataset that contains worldwide tweets covering over 10 years from 2012 to present (real-time tweets are being collected around the clock). The total number of tweets as of October 2022 is around 18.6 billion. All tweets have been cleaned and converted to CSV files with each row for a single tweet.

Specifically, the database contains two types of tweets: geotagged tweets and randomly sampled tweets. The geotagged tweets are continuously collected using the official Twitter Streaming Application Programming Interface (API) [1] with geo filter from 2015 to present (about 8 billion as of December 2022).  A “geotagged tweet” is a tweet embedded with geolocation shared by the Twitter user. The locational accuracy of a geotagged tweet depends on how a Twitter user shares his/her location when posting a tweet. The location can be shared in the format of place names (e.g., country, state, city, neighborhood) or the exact latitude and longitude (determined by the device’s GPS or other signals such as cell tower). Our analysis of 2019 global Twitter data indicates that most tweets (79%) were geotagged at the city/county level, followed by first-level subdivision such as state or province (9.8%), exact coordinates (6.4%), country level (3.3%), and neighborhood/point of interest such as a park or a store (1.5%) [2]. Note that Twitter discontinued its function of tagging the precise location in tweets in June 2019, so geotagged tweets collected after June 2019 only contain the location of place names. The randomly sampled tweets were downloaded from the Internet Archive [3]. This is a static dataset containing about 1% random sample of the entire Twitter stream from January 2012 to June 2021. The data were processed and converted to the same CSV format as geotagged tweets for the ease of integration and analysis.

These tweets can be requested in two formats: 1) individual tweets ID filtered with designated keywords (e.g., COVID, HIV, Hurricane, Climate change), time period (year, month, day, hour), and geographic location (e.g., Columbia, SC; New York City; Japan); 2) spatially and/or temporally aggregated format (e.g., number of tweets in each county during a period; daily number of tweets mentioning COVID in the US). In addition, we also developed a Python-based tool to allow users to download on-demand Twitter data using the new Twitter API for Academic Research and convert the data to CSV files for further analysis. The Academic Research API allows you to search the full history of public Tweets using keywords and other filters. Users who want to access this tool need to apply and provide their own Twitter API credentials [18] as there is a cap of 10 million tweets per month per account set by Twitter.

Applications (sample published work): Extensive literature from different domains have reported the use of Twitter data to study natural hazards [4-7], public health [8-10], human dynamics [11-14], and climate change [15,16] to name a few. In addition, geotagged tweets provide unprecedented opportunities for location-based research topics that go beyond social studies. For example, geotagged tweets offer enormous opportunities for disaster management by examining the physical infrastructure (e.g., road damage) [17] environment (e.g., flood extent) [6], and nature-human interaction (e.g., evacuation) [7] from spatial, temporal, and social dimensions.

References

1. Consuming streaming data, https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data
2. Li, Z., Huang, X., Ye, X., Jiang, Y., Martin, Y., Ning, H., … & Li, X. (2021). Measuring global multi-scale place connectivity using geotagged social media data. Scientific Reports, 11(1), 1-19.
3. Internet Archive, https://archive.org/details/twitterstream
4. Jiang Y., Li Z., Cutter S., (2021) Social Distance Integrated Gravity Model for Evacuation Destination Choice, International Journal of Digital Earth, https://doi.org/10.1080/17538947.2021.1915396
5. Li Z., Huang Q., Emrich C., (2019) Introduction to Social Sensing and Big Data Computing for Disaster Management, International Journal of Digital Earth, 12(11), 1198-1204.
6. Li, Z., Wang, C., Emrich, C. T., & Guo, D. (2018). A novel approach to leveraging social media for rapid flood mapping: a case study of the 2015 South Carolina floods. Cartography and Geographic Information Science, 45(2), 97-110.
7. Martín, Y., Li, Z., & Cutter, S. L. (2017). Leveraging Twitter to gauge evacuation compliance: spatiotemporal analysis of Hurricane Matthew. PLoS one, 12(7), e0181701.
8.  Li Z., Qiao S., Jiang Y., Li X., (2021), Building a Social media-based HIV Risk Behavior Index to Inform the Prediction of HIV New Diagnosis: A Feasibility Study, AIDS , https://doi.org/10.1097/qad.0000000000002787
9.  Li Z., Li X., Porter D., Zhang J., Jiang Y., Olatosi B., Weissman S. (2020)  Monitoring the Spatial Spread of COVID-19 and Effectiveness of Control Measures Through Human Movement Data: Proposal for a Predictive Model Using Big Data Analytics, JMIR Research Protocols, https://doi.org/10.2196/24432
10. Paul, M., & Dredze, M. (2011). You are what you tweet: Analyzing twitter for public health. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 5, No. 1, pp. 265-272).
11. Li, Z., Huang, X., Ye, X., Jiang, Y., Martin, Y., Ning, H., … & Li, X. (2021). Measuring global multi-scale place connectivity using geotagged social media data. Scientific Reports, 11(1), 1-19.
12. Huang X., Li Z., Jiang Y., Li X., Porter D. (2020) Twitter reveals human mobility dynamics during the COVID-19 pandemic, PloS One, https://doi.org/10.1371/journal.pone.0241957
13. Hu L., Li Z., Ye X., (2020) Delineating and Modelling Activity Space Using Geotagged Social Media Data, Cartography and Geographic Information Science, https://doi.org/10.1080/15230406.2019.1705187
14. Hu, F., Li, Z., Yang, C., & Jiang, Y. (2019). A graph-based approach to detecting tourist movement patterns using social media data. Cartography and Geographic Information Science, 46(4), 368-382.
15. Dahal, B., Kumar, S. A., & Li, Z. (2019). Topic modeling and sentiment analysis of global climate change tweets. Social network analysis and mining, 9(1), 1-20.
16. Cody, E. M., Reagan, A. J., Mitchell, L., Dodds, P. S., & Danforth, C. M. (2015). Climate change sentiment on Twitter: An unsolicited public opinion poll. PloS one, 10(8), e0136092.
17. Yuan, F., & Liu, R. (2020). Mining social media data for rapid damage assessment during Hurricane Matthew: Feasibility study. Journal of Computing in Civil Engineering, 34(3), 05020001. Chicago
18. How to get access to the Twitter API, https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api

 

Mode of Access

Remote online access

Level of Access

USC Researchers