Research Data Repositories 

This collection, listed below, contains datasets managed by the Big Data Health Science Center that are of interest to the BDHSC research community. Please send an email to BDHSC@mailbox.sc.edu if you have any queries or datasets that you’d want to share with the data science community.

 

  1. SC COVID-19 Cohort (S3C) Data
  2. SC Pregnancy Cohort Data
  3. Cell-phone Based Place Visitation Data
  4. Twitter Data

How to Get Access to Research Data Repositories

Submit a Research Data Request

Request a specific dataset for research purposes.

BDHSC Research Datasets

Data Coordinators

Bankole Olatosi & Jiajia Zhang

Data Description

Covid-19 data consists of Covid-19 test data, healthcare visits from all-payer system, healthcare visits from the Department of Mental Health (DMH), vaccine data, PEBA data, and Medical University of South Carolina (MUSC) data.

The Covid-19 test data reports demographics (age, gender, race, ethnicity), residence (county), disease status (positive or negative), specimen collection date and DHEC report date, and information if tested positive (derived from SC statewide Case Report Form [CRF]), including healthcare worker, pregnancy, Covid-19 symptom onset date, severity (symptoms, hospitalization and date, admitted ICU, respiratory support, death and date), and pre-existing conditions.

The all-payer system provides demographics (age group, race, sex, ZIP code), healthcare admission date, admission type, admission reason, physician specialty type, diagnosis code and procedure code (ICD-10 code), charges, insurance (primary payor), discharge date and status, and health facility information (hospital ID, bed size, teaching status).

The healthcare visits from DMH includes patients’ demographics (age group, sex, race, ethnicity, county, marital status), admission date, diagnosis code (ICD-10 code), disposition code, referral code, reason for transfer, place of service, insurance (payor), family income, income incident, household type, total charges, discharge date, facility information (county of service, geographic location of facility).

The vaccine dataset collects demographics (age, gender, state, ZIP code), vaccine date, and vaccine type.

The PEBA datasets consist of an eligibility dataset (data for State Health Plan & related HMOs clients, including age, sex, marital status, number of months eligible per calendar year), claims file (including age, admission date, facility place of service, ICD-10 diagnosis and procedure code and CPT procedure code, charges, length of stay, discharge date, discharge status), dental specialist data (age, service date, provider specialty, procedure code, charges, amount of paid), and pharmacy file (age, dispense date, drug code, drug name, therapeutic class, quantity, days of supply, charges, amount of paid).

MUSC datasets include demographics (sex, race, ethnicity), residence (state, ZIP code), diagnosis code (ICD-10 code), procedure code (ICD-10 and CPT code), visit information (admission date, admission source, encounter type, primary payor, discharge date, discharge type), laboratory information (order date, specimen collection date, specimen source, LOINC code, results), medication prescribing information (prescribing order/start/end date, drug code (RxNorm code), dosage, frequency), medication administration information (administration date, drug code (RxNorm code), dosage, route), vital measurement (weight, height, systolic, diastolic).

Mode of Access

Remote online access

Data Coordinators

Peiyin Hung & Jihong Liu

Data Description

1. Medicaid claims data (all care setting insurance claims data for Medicaid beneficiaries)

Medicaid data are a set of person- or episode-level data files derived from Medicaid Statistical Information System data on Medicaid eligibility, service utilization and payments.  The data are available for all states and the District of Columbia beginning with calendar year 1999.  The data are available for selected states prior to 1999.  These data are developed to support research and policy analysis initiatives for Medicaid and other low-income populations. To derive pregnancy cohorts from Medicaid claims data, one would have to use medical coding, such as ICD, CPT, DRG, and HCPCS codes for corresponding prenatal, labor and delivery, and/or postpartum care utilization according to a study focus.

2. UB file (hospital-based uniform billing data)

https://rfa.sc.gov/boards-committees/dataoversight

Encounter-level data files contain individual patient-level data using encounter-level data elements; release of these files requires an application and a signed Data Use Agreement. However, the RFA has permission to release aggregate customized reports based on encounter-level data without a signed agreement. The following considerations will be applied by RFA in creating encounter-level data files. Dates: All data elements that are date fields will be considered restricted data. Date fields provide unique information that when linked with other databases may identify an individual. On encounter-level files: Age will be reported in five-year age groupings and “under one category” (for children under one year of age),“one to four” (for children one to four years of age) and if over 84, reported in 85 and over category Length of stay will be provided rather than admission and discharge dates Month and day of week will be provided in lieu of admission date and/or discharge date Procedure Coding: Depending on the instructions in the Uniform Billing Manual (UB04), procedure codes will be coded with the ICD-9 CM procedure codes, the HCPCS procedure codes and/or the CPT4 procedure codes. When using the HCPCS and/or CPT4 procedure codes the units of service will be required according to the UB 04 coding manual. The variables admitting diagnosis, patient reason for visit, admission hour and discharge hour will be added beginning October 1, 2007. Modifications to the E-Codes will become effective October 1, 2007. The variable present of admission code for all diagnoses will be added beginning January 1, 2008. The variable NPI will be added beginning May 23, 2007.

3. Vital statistics data (birth and death certificates)

In the United States, State laws require birth certificates to be completed for all births, and Federal law mandates national collection and publication of births and other vital statistics data. The National Vital Statistics System, the Federal compilation of this data, is the result of the cooperation between the National Center for Health Statistics (NCHS) and the States to provide access to statistical information from birth certificates. Natality Data from the National Vital Statistics System of the National Center for Health Statistics provides demographic and health data for births occurring during the calendar year. The microdata is based on information abstracted from birth certificates filed in vital statistics offices of each State and District of Columbia. Other available birth data are Birth Cohort Linked Birth/Infant Death Data, Period Linked Birth/Infant Death Data from the Perinatal Mortality Data, and Matched Multiple Birth Data. Demographic data include variables such as date of birth, age and educational attainment of parents, marital status, live-birth order, race, sex, and geographic area. Health data include items such as birth weight, gestation, prenatal care, attendant at birth, and Apgar score. Geographic data includes state, county, city (available for cities of 250,000+ (up to 1980) and 100,000+ (1980-)), SMSA (1980-), and metropolitan and nonmetropolitan counties.

4. Pregnancy Risk Assessment Monitoring System (PRAMS) data

The Pregnancy Risk Assessment Monitoring System is a surveillance project of the Centers for Disease Control and Prevention (CDC) and health departments. Developed in 1987, PRAMS collects site-specific, population-based data on maternal attitudes and experiences before, during, and shortly after pregnancy. PRAMS surveillance currently covers about 81% of all U.S. births. There are five categories of the variables provided: 1) Birth Certificate Variables: Selected variables from the birth certificate file are included in the data set; information on maternal and infant demographics are primarily from this source. 2) Operational Variables: These variables come from the data collection process (e.g., mode the questionnaire was answered by mail or phone). 3) Weighting Variables: These variables account for the PRAMS survey design and the statistical weighting of the data. These variables are needed to analyze PRAMS data using complex sample software. 4) Questionnaire Variables: This is the information collected from the PRAMS survey, including language, pre-pregnancy and during-pregnancy lifestyles, prenatal care utilization, substance use, physical and mental health status, postpartum lifestyle, etc. 5) Analytic Variables: These are precalculated variables that combine different variables in the data set, often those that are restricted (e.g., body mass index [BMI] created by combining variables on maternal weight and height).

5. All-Payer Claims Databases (APCD)

All-payer claims databases (APCDs) are large State databases that include medical claims, pharmacy claims, dental claims, and eligibility and provider files collected from private and public payers. APCD data are reported directly by insurers to States, usually as part of a State mandate. In terms of their capacity to produce price, resource use, and quality information for consumers, APCD data have three potential advantages over other datasets: 1) They include information on private insurance that many other datasets do not. 2) They include data from most or all insurance companies operating in any particular State, in contrast to some proprietary datasets. 3) They include information on care for patients across care sites, rather than just hospitalizations and emergency department visits reported as part of discharge data systems maintained by most States through State governments or hospital associations. They also include large sample sizes, geographic representation, and capture of longitudinal information on a wide range of individual patients. To date, 18 States have legislation mandating the creation and use of APCDs or are actively establishing APCDs, and more than 30 States maintain, are developing, or have a strong interest in developing an APCD. These efforts vary, with a handful of States, such as New Hampshire, Maine, and Massachusetts, using APCD data to launch public Web sites with price and cost information for consumers while other States are not as far along in the process.

6. Healthcare Cost and Utilization Project National Inpatient Sample

The National (Nationwide) Inpatient Sample (NIS) is the largest publicly available all-payer hospital inpatient care database in the United States. Researchers and policymakers use NIS data to identify, track, and analyze trends in health care utilization, access, charges, quality, and outcomes.

7. Healthcare Cost and Utilization Project State Inpatient Databases

The State Inpatient Databases (SID) are a set of hospital databases containing the universe of the inpatient discharge abstracts from participating States, translated into a uniform format to facilitate multi-State comparisons and analyses. Researchers and policymakers use the SID to investigate questions and identify trends unique to one State, to compare data from two or more States, and to conduct market area research or small area variation analyses.

8.  State Ambulatory Surgery and Services Databases

The State Ambulatory Surgery and Services Databases (SASD) include encounter-level data for ambulatory surgery and other outpatient services from hospital-owned facilities. In addition, some States provide data for ambulatory surgery and outpatient services from nonhospital-owned facilities.

Mode of Access

Remote online access

Data Coordinators

Zhenlong Li

Data Description

The cellphone-based place visitation data were obtained from SafeGraph and processed by the Geoinformation and Big Data Research Lab at the Center for GIScience and Geospatial Big Data (CeGIS) in collaboration with BDHSC for academic research purposes. This data contains the monthly and weekly visitations flows originating from over 230,000 Census Block Groups (CBGs) to over 5 million Points of Interest (POIs) in the US from 01/01/2018 to 08/30/2022. These visitation flows are called “Origin-Destination-Time (ODT)” flows because each flow refers to a visitation record (number of visitors) from a CBG (Origin) to POI (Destination) during a specific period (Time).  In total, this dataset has 9.5 billion ODT flows and can be requested in two formats: 1) individual ODT flows filtered with time (year, month, week) and geographic location. For example, ODT flows visiting a specific park or restaurant during a week can be extracted; and 2) spatially and/or temporally aggregated format. For example, the weekly number of visitors to a hospital can be extracted to examine the visitation trend of that hospital.

Level of Access

Public (within USC)

Mode of Access

Remote on online access

Data Coordinators

Zhenlong Li

Data Description

The Twitter data were collected by the Geoinformation and Big Data Research Lab at the Center for GIScience and Geospatial Big Data (CeGIS) for academic research purposes.  This is a live dataset that contains worldwide tweets covering over 10 years from 2012 to present (real-time tweets are being collected around the clock). The total number of tweets as of December 2022 is around 18.6 billion. There are two types of Twitter data in the database: geotagged tweets and randomly sampled tweets.  The geotagged tweets are continuously collected using the official Twitter Streaming API with geo filters. The randomly sampled tweets were downloaded from the Internet Archive. All tweets have been cleaned and converted to CSV files with each row for a single tweet. These tweets can be requested in two formats: 1) individual tweets filtered with designated keywords (e.g., COVID, HIV, Hurricane, Climate change), time period (year, month, day, hour), and geographic location (e.g., Columbia, SC; New York City; Japan); and 2) spatially and/or temporally aggregated format (e.g., number of tweets in each county during a period; daily number of tweets mentioning COVID-19 in the US).

 

Level of Access

Public (within USC)

Mode of Access

Remote on online access

Request Information

If you have any further questions about the BDHSC Datasets, you can reach out to us by filling out the following form, and we will respond promptly. Please provide a detailed description to ensure that we have the information needed to assist you.

6 + 2 =