The aim of this work is to analyse what is being taught, to whom, how, etc. in statistics training courses (usually CPD, advanced training for research).
Why this might be of general relevance for statistics education:
Training courses reveal, and help close, teaching-workplace skills/knowledge gap
They are reflect evolving or emerging methods and tools
They are a principal means of teacher-training/upskilling
Databases
In total, we analysed 16,571 training courses descriptions:
ALLSTAT mailing list: 6843 posts from 1998 to 2025
NCRM training courses and events database: 8924 entries from 2003 to 2025
Elixir TeSS training course database: 804 entries from 2011 to 2025
All of these descriptions were webscraped from the three websites using Python.
Extracted information
For each training course description, we extracted
One line description of course
One term summary description of topic
Topic keywords
Intended audience (e.g. academic/research field) and level
Software
Duration, delivery method
Course provider
Extraction method
Information was extracted using a locally running large language model (LLM), Llama 3.3.
The LLM ran on a Linux workstation with a RTX A6000 GPU, 10,752 cuda cores and 48GB of VRAM.
The script was written in R using the ellmer package.
Approximately 2500 descriptions processed per day.
LLM API services provided by OpenAI could easily have been used (one line change in R code), approximately £50.