The full name of the GHCNd dataset is the Global Historical Climatology Network daily. It is a global observation dataset publicly released by NOAA, with data starting from 1763, containing daily meteorological observation data from a total of 120,000 weather stations worldwide. The monthly summary version of the GHCNd dataset is the Global Historical Climatology Network monthly.

During the weekend, I performed some statistical analysis on this dataset, and I’d like to record the general processing workflow.

First, I synchronized the data. This dataset is stored on NOAA’s own servers as well as on AWS/GCP public datasets. For simplicity, I chose the dataset on AWS. Initially, I used AWS CLI for synchronization, but JuiceFS can also be used.

1
2
3
4
5
6
7
8
# Sync via AWS CLI
aws s3 cp --region us-west-2 --no-sign-request s3://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt data/ghcnd-stations.txt
aws s3 cp --recursive --region us-west-2 --no-sign-request s3://noaa-ghcn-pds.s3.amazonaws.com/parquet/by_station data/parquet/by-station

# Sync via JuiceFS
juicefs sync s3://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt data/ghcnd-stations.txt

juicefs sync s3://noaa-ghcn-pds.s3.amazonaws.com/parquet/by_station data/parquet/by-station

Subsequently, I performed daily and monthly level data statistical work.

The core approach involves using Polars to calculate the required statistical information within specified time ranges. For example, the daily statistics calculation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Calculate daily statistics
daily_df = (
    df.group_by_dynamic("DATE", every="1d")
    .agg(
        pl.col("DATA_VALUE").mean().alias(f"{element}_MEAN"),
        pl.col("DATA_VALUE").min().alias(f"{element}_MIN"),
        pl.col("DATA_VALUE").max().alias(f"{element}_MAX"),
        pl.col("DATA_VALUE").quantile(0.1).alias(f"{element}_P10"),
        pl.col("DATA_VALUE").quantile(0.2).alias(f"{element}_P20"),
        pl.col("DATA_VALUE").quantile(0.5).alias(f"{element}_P50"),
        pl.col("DATA_VALUE").quantile(0.8).alias(f"{element}_P80"),
        pl.col("DATA_VALUE").quantile(0.9).alias(f"{element}_P90"),
        pl.col("DATA_VALUE").sum().alias(f"{element}_SUM"),
        pl.col("DATA_VALUE").count().alias(f"{element}_COUNT"),
    )
    .collect()
)

Then I used Matplotlib for visualization:

Daily DataMonthly Data
Beijing
Shanghai
Tokyo

The processing code is open-sourced on GitHub at ringsaturn/ghcn-showcases. Additionally, a static page displaying statistical information for select stations is available on GitHub Pages at ghcn-showcases.

Web Page Preview:

Beijing Display Preview

Beijing Display Preview

Shanghai Display Preview

Shanghai Display Preview

Temperature Comparison of Beijing/Shanghai/Tokyo

Temperature Comparison of Beijing/Shanghai/Tokyo

Precipitation Comparison of Beijing/Shanghai/Tokyo

Precipitation Comparison of Beijing/Shanghai/Tokyo