Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ETL Batch Processing Project - No-Show Appointment

ETL Batch Processing Project - No-Show Appointment

Batch ETL pipeline analyzing hospital appointment no-show trends across age, SMS prompts, and welfare enrollment. GCP • PySpark • BigQuery • Looker Studio

Stack: GCS, PySpark, BigQuery, Looker Studio
Data: Hospital no-show appointments
Goal: Build a batch ETL pipeline and reporting dashboard

Project Description

This project seeks to investigate hospital appointments no-show trends. is attendance a function of age, ailment type or proximity to the hospital? It may also be that SMS prompt play a vital role in helping people honor hospital appointments.

Questions for Analysis

Objective

Technology Stack

The following technologies are used to build this project

Data Pipeline Architecture

Data pipeline architecture

Reproduce it yourself

  1. First clone this repo to your local machine.

git clone https://github.com/Khunmi/Capstone-Project---Data-Engineering-ZoomCamp

  1. Setup your Google Cloud environment

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Check out this link for a video walkthrough.

  1. Setup the necessary dependencies to run Apache Spark and interact with it using Pyspark within your local machine. This will be needed because I used pyspark for transformation and schema design for this project. See here for an instructional guide.

  2. Create a pipeline that loads your parquet files from an API to your Data Lake just as seen here

  3. Create a compute engine instance on Google cloud to enable you run a dataproc job within a serverless environment which also gives the felxibility of scheduling. See here for a video guide.

  4. Upload your python script containing your Spark job script to GCS and run your compute engine from the command line using the code below. Make sure you edit all parts that references my google cloud configurations to yours.

gcloud dataproc jobs submit pyspark \
    --cluster=de-zoomcamp-cluster \
    --region=northamerica-northeast2 \
    --project=khunmi-academy-376002 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    gs://dtc_data_lake_khunmi-academy-376002/code/capstone_proj.py \
    -- \
        --input_data=gs://dtc_data_lake_khunmi-academy-376002/capstone/ \
        --output=trips_data_all.capstone2023
  1. Google Data Studio: To generate insights from my dataset, I connected my data source located in my data warehouse(Bigquery). You can interact with the dashboard below.

Live Dashboard

Dataset Description

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.‘ScheduledDay’ tells us on what day the patient setup their appointment.‘Neighborhood’ indicates the location of the hospital.‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família:

|  | Variable|                Data_ Type|
|1.| PatientId|                   float64|
|2.| Appointment|                 IDint64|
|3.| Gender|                       object|
|4.| ScheduledDay|                 object|
|5.| AppointmentDay|               object|
|6.| Age|                           int64|
|7.| Neighbourhood|                object|
|8.| Scholarship|                   int64|
|9.| Hipertension|                  int64|
|10.| Diabetes|                     int64|
|11.| Alcoholism|                   int64|
|12.| Handicap|                      int64|
|13.| SMS_received|                 int64|
|14.| No-show |                    object|

Future work