Weekly Update Pipeline for Relational Database Design Using CDC Data

This data engineering project takes in data from Center for Disease Control's weekly updated disease table at state level. My team and I created a relational database that stores the disease data, Census population data, and disease symptoms data.

Overview of the Relational Database

As shown above, data is processed and stored in eight seperate tables. Among them, `disease`, `census`, `symptoms` were fetched using HTML and RESTful API calls. Only `census` is updated on a weekly basis on Sundays, scheduled using Prefect.

Tech Stack

Tools and Platforms: Google Cloud Platform (Cloud Run Functions, BigQuery, GCS Data Buckets), Prefect, Streamlit, Census API, LLM prompting

Languages: Python, SQL

Phase One: ETL of CDC data using static HTML by dynamically changing year, week, and disease code data. Function is scheduled using Prefect.

Phase Two: Added functions that trigger the training of machine learning process, and stored results in the ML data tables. Additionally, Census Decennial and ACS data was pulled in using RESTful API calls on Cloud run, including race, income, and population data at the state level.

Phase Three: Added in downstream tools such as Looker Studio and a Text to SQL app using Streamlit and prompting through Gemini LLM.

Text-to-SQL App Demonstration

See more in Github repos:

Team: https://github.com/Olivia-Peng/BA882-pipeline_G10

Streamlit: https://github.com/eshentong/streamlit-cdc