Safety Incident & Near-Miss Pattern Mining — Methanex Hackathon
TF-IDF RAG/Ollama Cloud/DSPy + GEPA Prompt Optimization/Dash/Plotly/Cloud Run/Full-Stack Dashboard
v2 — Apr 2026 Migrated from Vertex AI to Ollama Cloud + DSPy/GEPA
Author: Regan Yin | Team: Bubble Team — Reg Lei, Jeffrey Sun, Cayden Li, Regan Yin, Jiale Guan
Event: UBC MBAn Hackathon 2026 | Client: Methanex Corporation
A full-stack analytics dashboard and LLM-powered AI Safety Analyst that turns 5 years of unstructured incident records into actionable prevention intelligence — now fully reproducible at zero cost on the free Ollama Cloud tier with a DSPy + GEPA-tuned prompt and graceful corpus-only fallback.
View on GitHub → | Live MVP Dashboard →
Contents
Executive Summary Tech Stack Problem Statement v2 Architecture (Apr 2026) v1 vs v2 — Why We Migrated Step-by-Step Methodology NLP Clusters & Early Warning AI Safety Analyst (RAG + Cascading LLM) DSPy + GEPA Prompt Optimization Dashboard Features Selected Code Highlights Cloud Run Deployment Key Findings & Recommendations How to Run LocallyExecutive Summary
Methanex collects vast amounts of safety records, but they are reviewed case-by-case, making recurring patterns hard to spot. We built an end-to-end pipeline that clusters events via NLP, quantifies risk with an Early Warning Index, and deploys a Dash-based executive dashboard alongside a generative-AI advisor. v2 ships a complete re-architecture: TF-IDF retrieval + an Ollama Cloud LLM cascade with a DSPy + GEPA-tuned system instruction, all packaged into a Dockerized Cloud Run image that scales to zero.
Tech Stack
Core (v2)
Python 3.11 Dash Plotly Pandas / NumPy scikit-learn (TF-IDF) Ollama Cloud DSPy 2.6+ GEPA 0.0.4+ Gunicorn Docker Google Cloud RunLLM Cascade
gpt-oss:20b-cloud gemini-3-flash-preview:cloud gpt-oss:120b-cloud qwen3-coder:480b-cloud Legacy v1 (archived under legacy_rag_engine/)
Vertex AI Gemini 2.5 Flash Vector Search (Matching Engine) text-embedding-004 Cloud Storage LangChain Problem Statement
Safety records contain rich lessons, but reviewing them case-by-case makes recurring scenarios hard to spot and slows frontline guidance. Our challenge was to bridge the gap between raw, localized safety narratives and systemic business intelligence:
- Identify patterns and clusters of similar events (e.g., AI system failures, LOTO gaps, HR privacy exposures).
- Understand the factors driving higher severity (actual incidents) versus high-potential warnings (near-misses).
- Provide data-driven recommendations on where to focus prevention efforts and training.
- Let an investigator paste a hypothetical "what happened" snippet and immediately get a grounded, structured AI risk assessment with comparable historical cases.
v2 Architecture (Apr 2026)
The end-to-end pipeline is composed of four production-ready layers, all running locally in a single Dockerized Dash app:
app.py renders the executive dashboard, KPI tiles, cluster explorer, Early Warning module, and the AI Safety Analyst tab.safety_analyst.retrieve_similar_events() performs cosine similarity over an in-memory TF-IDF matrix (1–2 grams, sublinear TF, 20k features) built once at import time over the 2019–2024 events corpus. Top-k = 10.historical_context block and fused with a strict-JSON schema hint plus the GEPA-tuned system instruction loaded from dspy_gepa_best_config.json at import.gpt-oss:20b-cloud → gemini-3-flash-preview:cloud → gpt-oss:120b-cloud → qwen3-coder:480b-cloud. Each model is consulted only if the previous one timed out, returned empty content, produced unparseable JSON, or was missing too many schema keys.suggested_actions renumbered, best_practices rendered as a clean list. If 1–2 sections are missing, they are merged from the deterministic corpus fallback rather than discarding the LLM output.v1 vs v2 — Why We Migrated
The original Vertex AI / Gemini stack delivered great results during the hackathon but had a hard reproducibility cost: it required a billed GCP project, two long-running endpoints, and a service account key. v2 keeps the same public API (analyze_new_event(text, events_df, k)) so the Dash app needed only a one-line import swap, but the underlying engine is rebuilt around free, open-weight tooling.
| Concern | v1 Vertex AI / Gemini | v2 Ollama Cloud + DSPy/GEPA |
|---|---|---|
| Retrieval | MatchingEngineIndexEndpoint.find_neighbors() | Local TF-IDF on events_clean.csv |
| Embeddings | text-embedding-004 (Vertex, paid) | None — TF-IDF, $0 |
| LLM | gemini-2.5-flash via langchain_google_vertexai | Free-tier Ollama Cloud cascade |
| Prompt | Hand-written | DSPy + GEPA reflective optimization |
| Cold-start | 30–45 min one-time GCP index build | < 5 s in-memory TF-IDF build |
| Cloud deploy | Vertex endpoints (always-on) | Cloud Run + Docker (scale-to-zero) |
| Cost to reproduce | GCP project + billing + endpoint hosting | Zero — free Ollama key only |
| Failure mode | Hard 5xx if GCP quota / billing fails | Cascading retry → deterministic corpus fallback |
Anyone can now clone the repo, paste a free Ollama Cloud API key into .env, and run the dashboard locally without GCP, billing accounts, or service-account keys.
Step-by-Step Methodology
Step 1 — Data Ingestion & Preprocessing
Raw, messy text narratives and structured fields were cleaned and standardized. Key temporal and categorical variables (year, category_type, risk_level, severity) were extracted to build a solid analytical foundation.
Step 2 — NLP Pattern Mining & Clustering
Incident narratives (title + setting + what-happened + root-causes) were embedded and clustered to group events into 7 distinct, actionable "Risk Scenario" clusters. The mapping was saved as case_cluster_map.csv and is now consumed directly by the dashboard's filters.
Step 3 — Severity Driver Analysis & Early Warning Index
Each cluster is mapped against the ratio of Incidents (realized harm) to Near-Misses (free lessons), and a composite Early Warning Index is computed:
EWI = (Near-miss rate) × (High-priority share within near-misses) × log(1 + n_cases)
This prioritizes clusters where near-misses are frequent, serious, and occur at meaningful scale.
Step 4 — Generative AI Safety Analyst
A free-text input is TF-IDF retrieved against the corpus, fused into a strict-JSON RAG prompt, and sent through the Ollama Cloud cascade. The system instruction was tuned by DSPy + GEPA on a stratified eval set; missing schema sections are auto-filled from the deterministic corpus fallback.
Step 5 — Cloud Run Deployment
A slim Python 3.11 Dockerfile binds gunicorn to $PORT with 1 worker × 8 threads and a 120s timeout. .gcloudignore excludes the legacy folder, dev artifacts, and the eval-cases dump so production images stay lean.
NLP Clusters & Early Warning
We grouped 203 events into 7 clusters with searchable keywords:
| # | Cluster | Scenario | Primary Prevention Lever |
|---|---|---|---|
| 0 | AI Monitoring / Decision-Support Errors | AI alarms/recommendations mislead operations | Human-in-the-loop verification + drift monitoring |
| 1 | Stored Energy / LOTO Gaps | Hydraulic/pneumatic stored energy during maintenance | Multi-energy LOTO checklist + "try-actuate" verification |
| 2 | Office Electrical / Ergonomics | WFH/office safety: power bars, cords, trips, strains | Basic office safety standards + housekeeping checklist |
| 3 | Line Work / Piping Containment | Pipes/valves not fully depressurized → release | Standard line-break procedure + pressure confirmation |
| 4 | Field Safety: Height / Confined Space | Elevated work or confined space, often contractor tasks | Permit-to-work discipline + contractor supervision |
| 5 | Cyber-Physical Control Disruption | Unauthorized access affects controls → process deviation | Access hardening + segmentation + two-person rule |
| 6 | HR / Privacy Exposure | Sensitive HR info exposed (overheard calls, visible screens) | Privacy-by-default + secure sharing rules |
Priority Scoring
Each event combines a risk-level score and a severity score:
Risk Level Encoding
- High = 2
- Medium = 1
- Low = 0
Severity Encoding
- Serious / Major = 3
- Potentially Significant = 2
- Minor / Near Miss = 1
Priority = Risk + Severity (Low: 0–2, Medium: 2–4, High: ≥4). The Early Warning Index then aggregates across clusters to rank the most urgent areas of focus.
AI Safety Analyst (RAG + Cascading LLM)
The AI tab lets frontline users describe a situation in plain language and receive instant, evidence-based analysis. The pipeline is a TF-IDF RAG followed by a fault-tolerant LLM cascade.
Free-tier Ollama Cloud cascade
gpt-oss:20b-cloud Primary — clean 6-key JSON, deterministic, lowest latency. ~5 s gemini-3-flash-preview:cloud Fallback #1 — usually clean; occasional truncation. ~9 s gpt-oss:120b-cloud Fallback #2 — big reasoning, content + thinking both populated. ~9 s qwen3-coder:480b-cloud Fallback #3 — very high quality, slow. ~80 s Each model is consulted only if the previous one raised, returned empty content, or produced unparseable / heavily-incomplete JSON. The lineup is overridable via OLLAMA_MODEL and OLLAMA_FALLBACK_MODELS.
Strict JSON output contract
The model is constrained to emit exactly six keys, each normalized post-hoc by safety_analyst.py:
| Key | Domain / Format |
|---|---|
risk_level | One of Low, Medium, High |
severity | One of Minor, Potentially Significant, Serious, Major, Near Miss |
category_type | One of Incident, Near Miss, Other |
root_cause | 2–4 sentence paragraph grounded in the retrieved cases |
suggested_actions | 3–5 numbered actions, each "N. <sentence>. Owner: <role>. Timing: <Immediate / <30 days / 30-90 days / >90 days>. Verification: <text>." |
best_practices | 2–3 short lessons drawn from the corpus |
gpt-oss:120b often returns message.content == "" when format="json" is forced and instead places the JSON object inside message.thinking. The engine reads both fields and picks the one that actually contains a JSON object, instead of failing and falling through. DSPy + GEPA Prompt Optimization
The system instruction shipped in dspy_gepa_best_config.json was produced by dspy_gepa_benchmark.py — a self-contained driver that tunes the prompt against the actual analyst pipeline (TF-IDF + Ollama). The flow:
- Build a stratified eval set from
events_clean.csv(5 input styles × 3 categories): keywords, title, snippet, mini-report, full report. - Sweep four hand-authored prompt styles (
balanced,strict_format,evidence_grounded,operational) against the live analyzer pipeline and pick the best. - Wrap a
dspy.ChainOfThoughtmodule around the strict-JSON signature. - Run
dspy.GEPAwith a high-temperature reflection LM and a structured metric that scores label exactness, action structure, root-cause grounding, length, and best-practices quality. - Persist the winning system instruction. The next
python app.pypicks it up automatically.
Composite metric (used both as the GEPA objective and for native scoring)
score = 0.22·risk + 0.20·severity + 0.16·category + 0.16·root_cause + 0.18·actions + 0.08·best_practices
- Label scores are exact-match with ordinal partial credit (severity tiers earn credit for adjacent buckets).
- Root cause is jointly scored on length [80–600 chars] and token grounding against the gold
root_causes+lessonscolumns. - Action structure rewards numbered formatting plus the presence of Owner / Timing / Verification markers.
The "winner" — operational style
After the sweep + GEPA reflection, the operational prompt style won. dspy_gepa_best_config.json contains the resulting instruction with explicit decision policy, output contract, formatting rules, and anti-pattern guards (e.g., no unicode bullets, no marketing filler, root cause must identify the underlying factor not restate the incident).
Re-tuning
# Quickest sanity check (~5 min on free-tier Ollama Cloud)
python dspy_gepa_benchmark.py --num-cases 24 --auto light
# Better quality (~20-30 min)
python dspy_gepa_benchmark.py --num-cases 36 --auto medium
# Just dump the eval cases without running GEPA
python dspy_gepa_benchmark.py --prepare-onlyDashboard Features
The Dash application (app.py, ~860 lines) delivers a production-quality executive dashboard with three main tabs:
Tab 1 — Performance Dashboard
- Animated KPI tiles (total cases, incident volume, near-miss conversion, high-risk %, severe cases) with client-side JavaScript counter animations.
- Overview sub-tab: bar charts by category type and risk level; line chart over time; stacked year breakdown; top primary classifications; filterable data table with CSV export.
- Clusters sub-tab: case-count ranking, per-cluster drilldown with donut chart, AI-extracted top terms & example incidents, ESG performance metrics, root-cause & corrective-action themes.
- Early Warning Dashboard: ranked EWI bar chart, counts heatmap (incidents / near-misses / high-priority), rates heatmap (near-miss rate, HP within NM, HP within Inc).
Tab 2 — Event Intelligence
- Sortable, filterable data portal with native search and CSV export for all historical events.
Tab 3 — AI Safety Analyst
- Free-text input for describing a hypothetical hazard scenario.
- Cascading LLM analysis with typewriter animation and source-event transparency.
- Expandable top-10 similar events table with individual CSV export.
- Response is annotated with the model used and the average TF-IDF similarity of the retrieved cases.
Global Filters
- Year range slider (2019–2024)
- Cluster dropdown (multi-select)
- High-risk-only toggle
Selected Code Highlights
TF-IDF retrieval (replaces Vertex Vector Search)
def _build_tfidf(df):
corpus = [_row_to_doc(r) for _, r in df.iterrows()]
vec = TfidfVectorizer(
stop_words="english",
ngram_range=(1, 2),
max_df=0.95,
min_df=1,
sublinear_tf=True,
max_features=20000,
)
return vec, vec.fit_transform(corpus)
VECTORIZER, EVENT_MATRIX = _build_tfidf(EVENTS_DF)
def retrieve_similar_events(query, k=10):
qvec = VECTORIZER.transform([query])
sims = cosine_similarity(qvec, EVENT_MATRIX).ravel()
idx = np.argsort(sims)[::-1][:k]
out = EVENTS_DF.iloc[idx].copy()
out["similarity"] = np.round(sims[idx], 4)
return out.reset_index(drop=True)Cascading Ollama Cloud call with thinking-channel fallback
def _ollama_chat(prompt, system, model=None):
"""Reasoning models (gpt-oss:120b) often emit empty `content` when
format='json' is forced and put the JSON inside `thinking`. Read both
and prefer the channel that actually contains a JSON object."""
client = Client(host=OLLAMA_API_BASE,
headers={"Authorization": f"Bearer {OLLAMA_API_KEY}"},
timeout=LLM_TIMEOUT_S)
response = client.chat(
model=model or OLLAMA_MODEL,
messages=[{"role": "system", "content": system},
{"role": "user", "content": prompt}],
format="json",
options={"temperature": 0.2, "num_predict": 2500},
)
content = (response.message.content or "").strip()
thinking = (response.message.thinking or "").strip()
if not content and thinking: return thinking
if content and "{" not in content and "{" in thinking: return thinking
return contentCascading retry + corpus gap-fill
candidate_models = [OLLAMA_MODEL, *OLLAMA_FALLBACK_MODELS]
payload, used_model, merged_fields = None, None, []
for model in candidate_models:
try:
normalized, _ = _try_llm(prompt, model)
missing = [f for f in _REQUIRED_LLM_FIELDS if not normalized.get(f)]
if len(missing) > MAX_MISSING_FIELDS_BEFORE_FULL_FALLBACK:
raise RuntimeError(f"missing too many sections: {missing}")
if missing: # 1-2 missing -> merge
fb = fallback_response(top_k)
for fld in missing: normalized[fld] = fb[fld]
merged_fields = missing
payload, used_model = normalized, model
break
except Exception as exc:
log.warning("%s failed: %s", model, exc); continue
if payload is None: # every model died
payload = fallback_response(top_k) # deterministic corpus-onlyDSPy signature for GEPA optimization
class AnalyzeSafetyEvent(dspy.Signature):
"""Methanex EPSSC safety analyst: predict labels and write a structured
report grounded in retrieved historical cases. Strict-JSON output."""
incident_input: str = dspy.InputField(desc="User report (keywords / snippet / full).")
historical_context: str = dspy.InputField(desc="TF-IDF retrieved similar cases.")
style_guide: str = dspy.InputField(desc="Stylistic guidance for the response.")
risk_level: str = dspy.OutputField(desc='"Low" | "Medium" | "High".')
severity: str = dspy.OutputField(desc='"Minor" | "Potentially Significant" | "Serious" | "Major" | "Near Miss".')
category_type: str = dspy.OutputField(desc='"Incident" | "Near Miss" | "Other".')
root_cause: str = dspy.OutputField(desc="2-4 sentences grounded in historical context.")
suggested_actions: str = dspy.OutputField(desc="3-5 numbered actions w/ Owner / Timing / Verification.")
best_practices: str = dspy.OutputField(desc="2-3 short bullet lessons from the corpus.")Composite GEPA metric with structured feedback
def metric_from_payload(case, payload):
risk = _label_score(payload["risk_level"], case.gold_risk, VALID_RISK)
sev = _label_score(payload["severity"], case.gold_severity, VALID_SEVERITY)
cat = _label_score(payload["category_type"], case.gold_category, VALID_CATEGORY)
rc = 0.5*_length_score(payload["root_cause"], 80, 600) \
+ 0.5*_grounding_score(payload["root_cause"],
case.gold_root_cause, case.gold_lessons)
act = _action_score(payload["suggested_actions"])
bp = 0.6*_length_score(payload["best_practices"], 50, 480) \
+ 0.4*_grounding_score(payload["best_practices"], "", case.gold_lessons)
final = 0.22*risk + 0.20*sev + 0.16*cat + 0.16*rc + 0.18*act + 0.08*bp
feedback = (f"risk={risk:.2f}, sev={sev:.2f}, cat={cat:.2f}, "
f"rc_ground={rc:.2f}, action_struct={act:.2f}, bp={bp:.2f}. "
"Match risk/severity/category EXACTLY. Each action MUST contain "
"Owner, Timing, AND Verification. Ground root cause in retrieved cases.")
return final, feedbackCloud Run Deployment
A slim Python 3.11 image binds gunicorn to $PORT with 1 worker × 8 threads and a 120s timeout. The Ollama key is wired in via Secret Manager so no credentials live in the image:
PROJECT_ID=your-gcp-project
REGION=us-west1
# 1. Build & push to Artifact Registry
gcloud builds submit \
--tag $REGION-docker.pkg.dev/$PROJECT_ID/methanex/epssc-dashboard:latest
# 2. Stash the Ollama key in Secret Manager
gcloud secrets create OLLAMA_API_KEY --data-file=- <<< "<paste-your-key>"
# 3. Deploy with the secret mounted as an env var
gcloud run deploy methanex-epssc \
--image $REGION-docker.pkg.dev/$PROJECT_ID/methanex/epssc-dashboard:latest \
--region $REGION \
--allow-unauthenticated \
--update-secrets OLLAMA_API_KEY=OLLAMA_API_KEY:latest \
--memory 1Gi --cpu 1 --concurrency 40 --timeout 180 .gcloudignore excludes the entire legacy_rag_engine/ folder, local .env files, caches, virtualenvs, notebooks, and the dev-only dspy_gepa_eval_cases.json so production uploads stay lean.
Key Findings & Recommendations
- Focus on dominant clusters: The Pareto module shows which operational clusters account for the highest volume of reports and highest combined Risk × Severity priority.
- Target high-severity ratios: Clusters with a high Incident-to-Near-Miss conversion rate (e.g., AI Monitoring & Decision-Support Errors in 2024) are vulnerabilities where current defenses are frequently failing — those should be the top investment areas.
- Proactive monitoring: The timeline module surfaces emerging risks (e.g., a 2024 spike in IT/AI-related exposures) before they become systemic hazards.
- Grounded triage at the desk: The Generative AI Safety Analyst gives any investigator a structured, corpus-grounded triage of a free-text incident in seconds — including the top-10 most similar historical events for cross-reference.
- Resilience built-in: The cascading LLM + deterministic corpus fallback means the dashboard is never silently broken — every response is annotated with its source.
How to Run Locally
Repository Structure
methanex-safety-intelligence/
│
├── app.py # Dash dashboard (UI, KPIs, AI Analyst tab)
├── safety_analyst.py # NEW v2 — TF-IDF + Ollama Cloud RAG engine
├── dspy_gepa_benchmark.py # NEW v2 — DSPy + GEPA prompt-optimization driver
├── dspy_gepa_best_config.json # NEW v2 — persisted GEPA-tuned system instruction
├── dspy_gepa_eval_cases.json # NEW v2 — stratified eval cases (reproducibility)
│
├── Dockerfile # NEW v2 — Cloud Run production image
├── .gcloudignore # NEW v2 — excludes legacy / dev files from deploy
├── .env.example # Updated — Ollama Cloud + optional legacy vars
├── requirements.txt # Updated — Ollama, DSPy, GEPA, gunicorn
│
├── data/
│ ├── events_clean.csv # Cleaned 2019-2024 event corpus
│ ├── actions_clean.csv # Cleaned recommended actions
│ ├── case_cluster_map.csv # NLP-generated cluster mappings
│ ├── case_priority_scores.csv
│ ├── cluster_profile_sorted.csv
│ ├── cluster_summary_with_terms_examples.csv
│ └── near_miss_early_warning_dashboard.csv
│
├── assets/
│ ├── style.css # Methanex corporate CSS
│ ├── logo.svg
│ └── favicon.ico
│
├── legacy_rag_engine/ # ARCHIVED — Vertex AI / Gemini v1 stack
│ ├── README.md # Why this folder exists + how to re-enable
│ ├── rag_engine.py # Original Vertex Vector Search RAG
│ ├── setup_cloud.py # One-time GCP index/endpoint bootstrap
│ ├── data_processing.py # Historical KPI helper (no longer imported)
│ └── gcp-key.json # Template service-account JSON
│
├── LICENSE
└── README.mdThree commands — zero GCP required
git clone https://github.com/Regan-Yin/Hackathon_Project_Incident_mining_Methanex.git
cd Hackathon_Project_Incident_mining_Methanex
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Get a free Ollama Cloud key at https://ollama.com/settings/keys
cp .env.example .env
# Paste your key into OLLAMA_API_KEY=...
python app.py
# → http://127.0.0.1:8050/Open the Generative AI Safety Analyst tab, paste any short snippet (e.g., "Operator slipped near unsealed valve during night shift"), and within ~5–10 s you'll see a predicted Risk / Severity / Category, a grounded Root Cause paragraph, 3–5 structured Suggested Actions, 2–3 corpus-derived Best Practices, and the Top 10 Similar Historical Events table.