Safety Incident & Near-Miss Pattern Mining — Methanex Hackathon

TF-IDF RAG/Ollama Cloud/DSPy + GEPA Prompt Optimization/Dash/Plotly/Cloud Run/Full-Stack Dashboard

v2 — Apr 2026 Migrated from Vertex AI to Ollama Cloud + DSPy/GEPA

Author: Regan Yin  |  Team: Bubble Team — Reg Lei, Jeffrey Sun, Cayden Li, Regan Yin, Jiale Guan
Event: UBC MBAn Hackathon 2026  |  Client: Methanex Corporation
A full-stack analytics dashboard and LLM-powered AI Safety Analyst that turns 5 years of unstructured incident records into actionable prevention intelligence — now fully reproducible at zero cost on the free Ollama Cloud tier with a DSPy + GEPA-tuned prompt and graceful corpus-only fallback.

View on GitHub →  |  Live MVP Dashboard →

Executive Summary

Dataset
203 incidents · 1,659 actions
Timespan
2019 – 2024
NLP Clusters
7 risk scenarios
LLM Cascade
4 free models
Cost to Reproduce
$0
Cold-start
< 5 s (TF-IDF in-memory)

Methanex collects vast amounts of safety records, but they are reviewed case-by-case, making recurring patterns hard to spot. We built an end-to-end pipeline that clusters events via NLP, quantifies risk with an Early Warning Index, and deploys a Dash-based executive dashboard alongside a generative-AI advisor. v2 ships a complete re-architecture: TF-IDF retrieval + an Ollama Cloud LLM cascade with a DSPy + GEPA-tuned system instruction, all packaged into a Dockerized Cloud Run image that scales to zero.

Tech Stack

Core (v2)

Python 3.11 Dash Plotly Pandas / NumPy scikit-learn (TF-IDF) Ollama Cloud DSPy 2.6+ GEPA 0.0.4+ Gunicorn Docker Google Cloud Run

LLM Cascade

gpt-oss:20b-cloud gemini-3-flash-preview:cloud gpt-oss:120b-cloud qwen3-coder:480b-cloud

Legacy v1 (archived under legacy_rag_engine/)

Vertex AI Gemini 2.5 Flash Vector Search (Matching Engine) text-embedding-004 Cloud Storage LangChain

Problem Statement

Safety records contain rich lessons, but reviewing them case-by-case makes recurring scenarios hard to spot and slows frontline guidance. Our challenge was to bridge the gap between raw, localized safety narratives and systemic business intelligence:

  1. Identify patterns and clusters of similar events (e.g., AI system failures, LOTO gaps, HR privacy exposures).
  2. Understand the factors driving higher severity (actual incidents) versus high-potential warnings (near-misses).
  3. Provide data-driven recommendations on where to focus prevention efforts and training.
  4. Let an investigator paste a hypothetical "what happened" snippet and immediately get a grounded, structured AI risk assessment with comparable historical cases.

v2 Architecture (Apr 2026)

The end-to-end pipeline is composed of four production-ready layers, all running locally in a single Dockerized Dash app:

1. Dash UI & Plotly visualsapp.py renders the executive dashboard, KPI tiles, cluster explorer, Early Warning module, and the AI Safety Analyst tab.
2. Retrieval (TF-IDF)safety_analyst.retrieve_similar_events() performs cosine similarity over an in-memory TF-IDF matrix (1–2 grams, sublinear TF, 20k features) built once at import time over the 2019–2024 events corpus. Top-k = 10.
3. Prompt assembly — The retrieved cases are formatted into a compact historical_context block and fused with a strict-JSON schema hint plus the GEPA-tuned system instruction loaded from dspy_gepa_best_config.json at import.
4. Cascading Ollama Cloud LLM callgpt-oss:20b-cloudgemini-3-flash-preview:cloudgpt-oss:120b-cloudqwen3-coder:480b-cloud. Each model is consulted only if the previous one timed out, returned empty content, produced unparseable JSON, or was missing too many schema keys.
5. JSON sanitization & gap-fill — The 6-key payload is normalized: labels coerced to canonical vocabularies, unicode bullets stripped, suggested_actions renumbered, best_practices rendered as a clean list. If 1–2 sections are missing, they are merged from the deterministic corpus fallback rather than discarding the LLM output.
6. Markdown rendering — Final response is composed for the Dash typewriter effect, annotated with the model used and the average TF-IDF similarity of the retrieved cases.
Reliability: If every cascade model fails, the deterministic corpus-only fallback composes a complete, well-formatted report directly from the top-5 similar cases (modal labels, deduped root causes, recorded actions, and lessons). The UI always renders something useful — annotated with "LLM unavailable — response generated deterministically from the historical corpus."

v1 vs v2 — Why We Migrated

The original Vertex AI / Gemini stack delivered great results during the hackathon but had a hard reproducibility cost: it required a billed GCP project, two long-running endpoints, and a service account key. v2 keeps the same public API (analyze_new_event(text, events_df, k)) so the Dash app needed only a one-line import swap, but the underlying engine is rebuilt around free, open-weight tooling.

Concern v1 Vertex AI / Gemini v2 Ollama Cloud + DSPy/GEPA
Retrieval MatchingEngineIndexEndpoint.find_neighbors() Local TF-IDF on events_clean.csv
Embeddings text-embedding-004 (Vertex, paid) None — TF-IDF, $0
LLM gemini-2.5-flash via langchain_google_vertexai Free-tier Ollama Cloud cascade
Prompt Hand-written DSPy + GEPA reflective optimization
Cold-start 30–45 min one-time GCP index build < 5 s in-memory TF-IDF build
Cloud deploy Vertex endpoints (always-on) Cloud Run + Docker (scale-to-zero)
Cost to reproduce GCP project + billing + endpoint hosting Zero — free Ollama key only
Failure mode Hard 5xx if GCP quota / billing fails Cascading retry → deterministic corpus fallback

Anyone can now clone the repo, paste a free Ollama Cloud API key into .env, and run the dashboard locally without GCP, billing accounts, or service-account keys.

Step-by-Step Methodology

Step 1 — Data Ingestion & Preprocessing

Raw, messy text narratives and structured fields were cleaned and standardized. Key temporal and categorical variables (year, category_type, risk_level, severity) were extracted to build a solid analytical foundation.

Step 2 — NLP Pattern Mining & Clustering

Incident narratives (title + setting + what-happened + root-causes) were embedded and clustered to group events into 7 distinct, actionable "Risk Scenario" clusters. The mapping was saved as case_cluster_map.csv and is now consumed directly by the dashboard's filters.

Step 3 — Severity Driver Analysis & Early Warning Index

Each cluster is mapped against the ratio of Incidents (realized harm) to Near-Misses (free lessons), and a composite Early Warning Index is computed:

EWI = (Near-miss rate) × (High-priority share within near-misses) × log(1 + n_cases)

This prioritizes clusters where near-misses are frequent, serious, and occur at meaningful scale.

Step 4 — Generative AI Safety Analyst

A free-text input is TF-IDF retrieved against the corpus, fused into a strict-JSON RAG prompt, and sent through the Ollama Cloud cascade. The system instruction was tuned by DSPy + GEPA on a stratified eval set; missing schema sections are auto-filled from the deterministic corpus fallback.

Step 5 — Cloud Run Deployment

A slim Python 3.11 Dockerfile binds gunicorn to $PORT with 1 worker × 8 threads and a 120s timeout. .gcloudignore excludes the legacy folder, dev artifacts, and the eval-cases dump so production images stay lean.

NLP Clusters & Early Warning

We grouped 203 events into 7 clusters with searchable keywords:

# Cluster Scenario Primary Prevention Lever
0 AI Monitoring / Decision-Support Errors AI alarms/recommendations mislead operations Human-in-the-loop verification + drift monitoring
1 Stored Energy / LOTO Gaps Hydraulic/pneumatic stored energy during maintenance Multi-energy LOTO checklist + "try-actuate" verification
2 Office Electrical / Ergonomics WFH/office safety: power bars, cords, trips, strains Basic office safety standards + housekeeping checklist
3 Line Work / Piping Containment Pipes/valves not fully depressurized → release Standard line-break procedure + pressure confirmation
4 Field Safety: Height / Confined Space Elevated work or confined space, often contractor tasks Permit-to-work discipline + contractor supervision
5 Cyber-Physical Control Disruption Unauthorized access affects controls → process deviation Access hardening + segmentation + two-person rule
6 HR / Privacy Exposure Sensitive HR info exposed (overheard calls, visible screens) Privacy-by-default + secure sharing rules

Priority Scoring

Each event combines a risk-level score and a severity score:

Risk Level Encoding

  • High = 2
  • Medium = 1
  • Low = 0

Severity Encoding

  • Serious / Major = 3
  • Potentially Significant = 2
  • Minor / Near Miss = 1

Priority = Risk + Severity (Low: 0–2, Medium: 2–4, High: ≥4). The Early Warning Index then aggregates across clusters to rank the most urgent areas of focus.

AI Safety Analyst (RAG + Cascading LLM)

The AI tab lets frontline users describe a situation in plain language and receive instant, evidence-based analysis. The pipeline is a TF-IDF RAG followed by a fault-tolerant LLM cascade.

Free-tier Ollama Cloud cascade

gpt-oss:20b-cloud Primary — clean 6-key JSON, deterministic, lowest latency. ~5 s
gemini-3-flash-preview:cloud Fallback #1 — usually clean; occasional truncation. ~9 s
gpt-oss:120b-cloud Fallback #2 — big reasoning, content + thinking both populated. ~9 s
qwen3-coder:480b-cloud Fallback #3 — very high quality, slow. ~80 s

Each model is consulted only if the previous one raised, returned empty content, or produced unparseable / heavily-incomplete JSON. The lineup is overridable via OLLAMA_MODEL and OLLAMA_FALLBACK_MODELS.

Strict JSON output contract

The model is constrained to emit exactly six keys, each normalized post-hoc by safety_analyst.py:

Key Domain / Format
risk_level One of Low, Medium, High
severity One of Minor, Potentially Significant, Serious, Major, Near Miss
category_type One of Incident, Near Miss, Other
root_cause 2–4 sentence paragraph grounded in the retrieved cases
suggested_actions 3–5 numbered actions, each "N. <sentence>. Owner: <role>. Timing: <Immediate / <30 days / 30-90 days / >90 days>. Verification: <text>."
best_practices 2–3 short lessons drawn from the corpus
Reasoning-model quirk handled: gpt-oss:120b often returns message.content == "" when format="json" is forced and instead places the JSON object inside message.thinking. The engine reads both fields and picks the one that actually contains a JSON object, instead of failing and falling through.

DSPy + GEPA Prompt Optimization

The system instruction shipped in dspy_gepa_best_config.json was produced by dspy_gepa_benchmark.py — a self-contained driver that tunes the prompt against the actual analyst pipeline (TF-IDF + Ollama). The flow:

  1. Build a stratified eval set from events_clean.csv (5 input styles × 3 categories): keywords, title, snippet, mini-report, full report.
  2. Sweep four hand-authored prompt styles (balanced, strict_format, evidence_grounded, operational) against the live analyzer pipeline and pick the best.
  3. Wrap a dspy.ChainOfThought module around the strict-JSON signature.
  4. Run dspy.GEPA with a high-temperature reflection LM and a structured metric that scores label exactness, action structure, root-cause grounding, length, and best-practices quality.
  5. Persist the winning system instruction. The next python app.py picks it up automatically.

Composite metric (used both as the GEPA objective and for native scoring)

score = 0.22·risk + 0.20·severity + 0.16·category + 0.16·root_cause + 0.18·actions + 0.08·best_practices

  • Label scores are exact-match with ordinal partial credit (severity tiers earn credit for adjacent buckets).
  • Root cause is jointly scored on length [80–600 chars] and token grounding against the gold root_causes + lessons columns.
  • Action structure rewards numbered formatting plus the presence of Owner / Timing / Verification markers.

The "winner" — operational style

After the sweep + GEPA reflection, the operational prompt style won. dspy_gepa_best_config.json contains the resulting instruction with explicit decision policy, output contract, formatting rules, and anti-pattern guards (e.g., no unicode bullets, no marketing filler, root cause must identify the underlying factor not restate the incident).

Re-tuning

# Quickest sanity check (~5 min on free-tier Ollama Cloud)
python dspy_gepa_benchmark.py --num-cases 24 --auto light

# Better quality (~20-30 min)
python dspy_gepa_benchmark.py --num-cases 36 --auto medium

# Just dump the eval cases without running GEPA
python dspy_gepa_benchmark.py --prepare-only

Dashboard Features

The Dash application (app.py, ~860 lines) delivers a production-quality executive dashboard with three main tabs:

Tab 1 — Performance Dashboard

  • Animated KPI tiles (total cases, incident volume, near-miss conversion, high-risk %, severe cases) with client-side JavaScript counter animations.
  • Overview sub-tab: bar charts by category type and risk level; line chart over time; stacked year breakdown; top primary classifications; filterable data table with CSV export.
  • Clusters sub-tab: case-count ranking, per-cluster drilldown with donut chart, AI-extracted top terms & example incidents, ESG performance metrics, root-cause & corrective-action themes.
  • Early Warning Dashboard: ranked EWI bar chart, counts heatmap (incidents / near-misses / high-priority), rates heatmap (near-miss rate, HP within NM, HP within Inc).

Tab 2 — Event Intelligence

  • Sortable, filterable data portal with native search and CSV export for all historical events.

Tab 3 — AI Safety Analyst

  • Free-text input for describing a hypothetical hazard scenario.
  • Cascading LLM analysis with typewriter animation and source-event transparency.
  • Expandable top-10 similar events table with individual CSV export.
  • Response is annotated with the model used and the average TF-IDF similarity of the retrieved cases.

Global Filters

  • Year range slider (2019–2024)
  • Cluster dropdown (multi-select)
  • High-risk-only toggle

Selected Code Highlights

TF-IDF retrieval (replaces Vertex Vector Search)

def _build_tfidf(df):
    corpus = [_row_to_doc(r) for _, r in df.iterrows()]
    vec = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_df=0.95,
        min_df=1,
        sublinear_tf=True,
        max_features=20000,
    )
    return vec, vec.fit_transform(corpus)

VECTORIZER, EVENT_MATRIX = _build_tfidf(EVENTS_DF)

def retrieve_similar_events(query, k=10):
    qvec = VECTORIZER.transform([query])
    sims = cosine_similarity(qvec, EVENT_MATRIX).ravel()
    idx = np.argsort(sims)[::-1][:k]
    out = EVENTS_DF.iloc[idx].copy()
    out["similarity"] = np.round(sims[idx], 4)
    return out.reset_index(drop=True)

Cascading Ollama Cloud call with thinking-channel fallback

def _ollama_chat(prompt, system, model=None):
    """Reasoning models (gpt-oss:120b) often emit empty `content` when
    format='json' is forced and put the JSON inside `thinking`. Read both
    and prefer the channel that actually contains a JSON object."""
    client = Client(host=OLLAMA_API_BASE,
                    headers={"Authorization": f"Bearer {OLLAMA_API_KEY}"},
                    timeout=LLM_TIMEOUT_S)
    response = client.chat(
        model=model or OLLAMA_MODEL,
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": prompt}],
        format="json",
        options={"temperature": 0.2, "num_predict": 2500},
    )
    content  = (response.message.content  or "").strip()
    thinking = (response.message.thinking or "").strip()
    if not content and thinking:                     return thinking
    if content and "{" not in content and "{" in thinking:  return thinking
    return content

Cascading retry + corpus gap-fill

candidate_models = [OLLAMA_MODEL, *OLLAMA_FALLBACK_MODELS]
payload, used_model, merged_fields = None, None, []

for model in candidate_models:
    try:
        normalized, _ = _try_llm(prompt, model)
        missing = [f for f in _REQUIRED_LLM_FIELDS if not normalized.get(f)]
        if len(missing) > MAX_MISSING_FIELDS_BEFORE_FULL_FALLBACK:
            raise RuntimeError(f"missing too many sections: {missing}")
        if missing:                                  # 1-2 missing -> merge
            fb = fallback_response(top_k)
            for fld in missing: normalized[fld] = fb[fld]
            merged_fields = missing
        payload, used_model = normalized, model
        break
    except Exception as exc:
        log.warning("%s failed: %s", model, exc); continue

if payload is None:                                  # every model died
    payload = fallback_response(top_k)               # deterministic corpus-only

DSPy signature for GEPA optimization

class AnalyzeSafetyEvent(dspy.Signature):
    """Methanex EPSSC safety analyst: predict labels and write a structured
    report grounded in retrieved historical cases. Strict-JSON output."""

    incident_input:     str = dspy.InputField(desc="User report (keywords / snippet / full).")
    historical_context: str = dspy.InputField(desc="TF-IDF retrieved similar cases.")
    style_guide:        str = dspy.InputField(desc="Stylistic guidance for the response.")

    risk_level:        str = dspy.OutputField(desc='"Low" | "Medium" | "High".')
    severity:          str = dspy.OutputField(desc='"Minor" | "Potentially Significant" | "Serious" | "Major" | "Near Miss".')
    category_type:     str = dspy.OutputField(desc='"Incident" | "Near Miss" | "Other".')
    root_cause:        str = dspy.OutputField(desc="2-4 sentences grounded in historical context.")
    suggested_actions: str = dspy.OutputField(desc="3-5 numbered actions w/ Owner / Timing / Verification.")
    best_practices:    str = dspy.OutputField(desc="2-3 short bullet lessons from the corpus.")

Composite GEPA metric with structured feedback

def metric_from_payload(case, payload):
    risk = _label_score(payload["risk_level"], case.gold_risk, VALID_RISK)
    sev  = _label_score(payload["severity"],   case.gold_severity, VALID_SEVERITY)
    cat  = _label_score(payload["category_type"], case.gold_category, VALID_CATEGORY)
    rc   = 0.5*_length_score(payload["root_cause"], 80, 600) \
         + 0.5*_grounding_score(payload["root_cause"],
                                case.gold_root_cause, case.gold_lessons)
    act  = _action_score(payload["suggested_actions"])
    bp   = 0.6*_length_score(payload["best_practices"], 50, 480) \
         + 0.4*_grounding_score(payload["best_practices"], "", case.gold_lessons)

    final = 0.22*risk + 0.20*sev + 0.16*cat + 0.16*rc + 0.18*act + 0.08*bp
    feedback = (f"risk={risk:.2f}, sev={sev:.2f}, cat={cat:.2f}, "
                f"rc_ground={rc:.2f}, action_struct={act:.2f}, bp={bp:.2f}. "
                "Match risk/severity/category EXACTLY. Each action MUST contain "
                "Owner, Timing, AND Verification. Ground root cause in retrieved cases.")
    return final, feedback

Cloud Run Deployment

A slim Python 3.11 image binds gunicorn to $PORT with 1 worker × 8 threads and a 120s timeout. The Ollama key is wired in via Secret Manager so no credentials live in the image:

PROJECT_ID=your-gcp-project
REGION=us-west1

# 1. Build & push to Artifact Registry
gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/methanex/epssc-dashboard:latest

# 2. Stash the Ollama key in Secret Manager
gcloud secrets create OLLAMA_API_KEY --data-file=- <<< "<paste-your-key>"

# 3. Deploy with the secret mounted as an env var
gcloud run deploy methanex-epssc \
  --image $REGION-docker.pkg.dev/$PROJECT_ID/methanex/epssc-dashboard:latest \
  --region $REGION \
  --allow-unauthenticated \
  --update-secrets OLLAMA_API_KEY=OLLAMA_API_KEY:latest \
  --memory 1Gi --cpu 1 --concurrency 40 --timeout 180

.gcloudignore excludes the entire legacy_rag_engine/ folder, local .env files, caches, virtualenvs, notebooks, and the dev-only dspy_gepa_eval_cases.json so production uploads stay lean.

Key Findings & Recommendations

  • Focus on dominant clusters: The Pareto module shows which operational clusters account for the highest volume of reports and highest combined Risk × Severity priority.
  • Target high-severity ratios: Clusters with a high Incident-to-Near-Miss conversion rate (e.g., AI Monitoring & Decision-Support Errors in 2024) are vulnerabilities where current defenses are frequently failing — those should be the top investment areas.
  • Proactive monitoring: The timeline module surfaces emerging risks (e.g., a 2024 spike in IT/AI-related exposures) before they become systemic hazards.
  • Grounded triage at the desk: The Generative AI Safety Analyst gives any investigator a structured, corpus-grounded triage of a free-text incident in seconds — including the top-10 most similar historical events for cross-reference.
  • Resilience built-in: The cascading LLM + deterministic corpus fallback means the dashboard is never silently broken — every response is annotated with its source.

How to Run Locally

Repository Structure

methanex-safety-intelligence/
│
├── app.py                          # Dash dashboard (UI, KPIs, AI Analyst tab)
├── safety_analyst.py               # NEW v2 — TF-IDF + Ollama Cloud RAG engine
├── dspy_gepa_benchmark.py          # NEW v2 — DSPy + GEPA prompt-optimization driver
├── dspy_gepa_best_config.json      # NEW v2 — persisted GEPA-tuned system instruction
├── dspy_gepa_eval_cases.json       # NEW v2 — stratified eval cases (reproducibility)
│
├── Dockerfile                      # NEW v2 — Cloud Run production image
├── .gcloudignore                   # NEW v2 — excludes legacy / dev files from deploy
├── .env.example                    # Updated — Ollama Cloud + optional legacy vars
├── requirements.txt                # Updated — Ollama, DSPy, GEPA, gunicorn
│
├── data/
│   ├── events_clean.csv            # Cleaned 2019-2024 event corpus
│   ├── actions_clean.csv           # Cleaned recommended actions
│   ├── case_cluster_map.csv        # NLP-generated cluster mappings
│   ├── case_priority_scores.csv
│   ├── cluster_profile_sorted.csv
│   ├── cluster_summary_with_terms_examples.csv
│   └── near_miss_early_warning_dashboard.csv
│
├── assets/
│   ├── style.css                   # Methanex corporate CSS
│   ├── logo.svg
│   └── favicon.ico
│
├── legacy_rag_engine/              # ARCHIVED — Vertex AI / Gemini v1 stack
│   ├── README.md                   # Why this folder exists + how to re-enable
│   ├── rag_engine.py               # Original Vertex Vector Search RAG
│   ├── setup_cloud.py              # One-time GCP index/endpoint bootstrap
│   ├── data_processing.py          # Historical KPI helper (no longer imported)
│   └── gcp-key.json                # Template service-account JSON
│
├── LICENSE
└── README.md

Three commands — zero GCP required

git clone https://github.com/Regan-Yin/Hackathon_Project_Incident_mining_Methanex.git
cd Hackathon_Project_Incident_mining_Methanex

python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Get a free Ollama Cloud key at https://ollama.com/settings/keys
cp .env.example .env
# Paste your key into OLLAMA_API_KEY=...

python app.py
# → http://127.0.0.1:8050/

Open the Generative AI Safety Analyst tab, paste any short snippet (e.g., "Operator slipped near unsealed valve during night shift"), and within ~5–10 s you'll see a predicted Risk / Severity / Category, a grounded Root Cause paragraph, 3–5 structured Suggested Actions, 2–3 corpus-derived Best Practices, and the Top 10 Similar Historical Events table.

If the LLM is unreachable, the same UI still renders — the deterministic corpus-only fallback fills every section and the response is annotated "LLM unavailable — response generated deterministically from the historical corpus."