Safety Incident & Near-Miss Pattern Mining — Methanex Hackathon

Author: Regan Yin | Team: Bubble Team — Reg Lei, Jeffrey Sun, Cayden Li, Regan Yin, Jiale Guan
Event: UBC MBAn Hackathon 2026 | Client: Methanex Corporation
A full-stack analytics dashboard and RAG-powered AI Safety Analyst that transforms 5 years of unstructured incident records into actionable prevention intelligence.

View on GitHub →

Executive Summary Tech Stack Problem Statement Step-by-Step Methodology Technical Architecture NLP Clusters & Early Warning AI Safety Analyst (RAG + Gemini) Dashboard Features Selected Code Highlights Key Findings & Recommendations How to Run Locally

Executive Summary

Dataset

203 incidents · 1,659 actions

Timespan

2019 – 2024

Locations

10+ sites

NLP Clusters

7 risk scenarios

AI Model

Gemini 2.5 Flash

Methanex collects vast amounts of safety records, but they are reviewed case-by-case, making recurring patterns hard to spot. We built an end-to-end pipeline that clusters events via NLP, quantifies risk with an Early Warning Index, and deploys a Dash-based executive dashboard alongside a Generative-AI advisor that retrieves similar historical incidents in real time.

Tech Stack

Python Dash Plotly Pandas / NumPy Google Cloud Platform Vertex AI Gemini 2.5 Flash Vector Search (Matching Engine) Text Embeddings (text-embedding-004) Cloud Storage LangChain NLP / Clustering RAG CSS (Custom Corporate Theme)

Problem Statement

Safety records contain rich lessons, but reviewing them case-by-case makes recurring scenarios hard to spot and slows frontline guidance. Our challenge was to bridge the gap between raw, localized safety narratives and systemic business intelligence:

Identify patterns and clusters of similar events (e.g., AI system failures, LOTO gaps, HR privacy exposures).
Understand the factors driving higher severity (actual incidents) versus high-potential warnings (near-misses).
Provide data-driven recommendations on where to focus prevention efforts and training.

Step-by-Step Methodology

Step 1 — Data Ingestion & Preprocessing

Raw, messy text narratives and structured fields were cleaned and standardized. Key temporal and categorical variables (year, category_type, risk_level, severity) were extracted to build a solid analytical foundation.

Step 2 — NLP Pattern Mining & Clustering

Using Vertex AI text-embedding-004, we converted incident narratives (title + setting + what happened + root causes) into high-dimensional vectors, then applied clustering to group events into 7 distinct, actionable "Risk Scenario" clusters. The mapping was saved as case_cluster_map.csv.

Step 3 — Severity Driver Analysis

We mapped each cluster against the ratio of Incidents (realized harm) to Near-Misses (free lessons). This allowed us to statistically identify which operational areas are high-potential risks that bypass current safety barriers.

A composite Early Warning Index was formulated:

EWI = (Near-miss rate) × (High-priority share within near-misses) × log(1 + n_cases)

This prioritizes clusters where near-misses are frequent, serious, and occur at meaningful scale.

Step 4 — Full-Stack MVP Deployment

We elevated localized Python scripts into a live MVP using Dash + Plotly with a custom Methanex corporate CSS theme and deployed it to the web so stakeholders could dynamically explore trends and drill into data themselves.

Technical Architecture

A modular design supports both the dashboard and the AI advisor while maintaining strong governance.

Data Sources

500+ pages incident records in PDF
Cleaned CSVs: events_clean.csv, actions_clean.csv
NLP cluster map: case_cluster_map.csv

Processing

Text extraction & structuring
Vertex AI embeddings (text-embedding-004)
Clustering & metrics/KPIs
Trend & driver analysis

Analytics Layer

NLP pattern mining
Early Warning Index
Priority scoring (Risk + Severity)
Evidence-based GenAI workflow

Experiences

Executive dashboard (KPI tiles, charts, heatmaps)
Interactive cluster explorer
GenAI Safety Analyst (RAG)
CSV export & data portal

Governance & Controls

Role-based access PII redaction Human curation of “gold” guidance Audit trails Model monitoring (drift + quality)

NLP Clusters & Early Warning

We grouped 203 events using narrative text into 7 clusters with searchable keywords:

#	Cluster	Scenario	Primary Prevention Lever
0	AI Monitoring / Decision-Support Errors	AI alarms/recommendations mislead operations	Human-in-the-loop verification + drift monitoring
1	Stored Energy / LOTO Gaps	Hydraulic/pneumatic stored energy during maintenance	Multi-energy LOTO checklist + “try-actuate” verification
2	Office Electrical / Ergonomics	WFH/office safety: power bars, cords, trips, strains	Basic office safety standards + housekeeping checklist
3	Line Work / Piping Containment	Pipes/valves not fully depressurized → release	Standard line-break procedure + pressure confirmation
4	Field Safety: Height / Confined Space	Elevated work or confined space, often contractor tasks	Permit-to-work discipline + contractor supervision
5	Cyber-Physical Control Disruption	Unauthorized access affects controls → process deviation	Access hardening + segmentation + two-person rule
6	HR / Privacy Exposure	Sensitive HR info exposed (overheard calls, visible screens)	Privacy-by-default + secure sharing rules

Priority & Early Warning Scoring

Each event is scored by combining risk level and severity:

Risk Level Encoding

High = 2
Medium = 1
Low = 0

Severity Encoding

Serious / Major = 3
Potentially Significant = 2
Minor = 1

Priority = Risk + Severity (Low: 0–2, Medium: 2–4, High: ≥4). The Early Warning Index then aggregates across clusters to rank the most urgent areas of focus.

AI Safety Analyst (RAG + Gemini)

The AI tab lets frontline users describe a situation in plain language and receive instant, evidence-based analysis. The pipeline works as follows:

Embed user query using Vertex AI text-embedding-004.
Vector Search via GCP Matching Engine retrieves the top 10 most similar historical incidents from the indexed corpus.
Gemini 2.5 Flash generates a structured analysis (predicted risk/severity, root cause, suggested actions) using the retrieved context.
Results are displayed with a typewriter animation and an expandable table of the 10 source events for transparency.

def analyze_new_event(hypothesis_text, events_df):
    """Finds Top 10 similar events via GCP Vector Search and generates analysis."""
    embeddings_model = TextEmbeddingModel.from_pretrained("text-embedding-004")
    query_vector = embeddings_model.get_embeddings([hypothesis_text])[0].values

    endpoint = aiplatform.MatchingEngineIndexEndpoint(index_endpoint_name=ENDPOINT_ID)
    response = endpoint.find_neighbors(
        deployed_index_id=DEPLOYED_INDEX_ID,
        queries=[query_vector],
        num_neighbors=10
    )

    top_10_ids = [neighbor.id for neighbor in response[0]]
    top_10_events = events_df[events_df['event_id'].isin(top_10_ids)]

    context = "TOP 10 HISTORICAL METHANEX EVENTS:\n\n"
    for idx, (_, row) in enumerate(top_10_events.iterrows(), 1):
        context += f"Event {idx}:\n"
        for col in top_10_events.columns:
            value = row[col]
            if pd.notna(value):
                context += f"  {col}: {value}\n"
        context += "\n"

    prompt = f"""You are an expert Process Safety Engineer for Methanex EPSSC.
    Analyze the following hypothetical incident/near-miss report based strictly
    on the historical events provided.

    Hypothetical Event Input:
    {hypothesis_text}

    {context}

    Provide your analysis STRICTLY in the following format:
    ### Predicted Risk Level & Severity
    ### Potential Root Cause
    ### Suggested Actions
    """

    llm = ChatVertexAI(model_name="gemini-2.5-flash", temperature=0.2)
    ai_response = llm.invoke(prompt)
    return ai_response.content, top_10_events

Dashboard Features

The Dash application (app.py, ~860 lines) delivers a production-quality executive dashboard with three main tabs:

Tab 1 — Performance Dashboard

Animated KPI tiles (total cases, incident volume, near-miss conversion, high-risk %, severe cases) with client-side JavaScript counter animations.
Overview sub-tab: bar charts by category type, risk level; line chart over time; stacked year breakdown; top primary classifications; filterable data table with CSV export.
Clusters sub-tab: case count ranking, per-cluster drilldown with donut chart, AI-extracted top terms & example incidents, ESG performance metrics, root-cause & corrective-action themes.
Early Warning Dashboard: ranked EWI bar chart, counts heatmap (incidents/near-misses/high-priority), rates heatmap (near-miss rate, HP within NM, HP within Inc).

Tab 2 — Event Intelligence

Sortable, filterable data portal with native search and CSV export for all historical events.

Tab 3 — AI Safety Analyst

Free-text input for describing a hypothetical hazard scenario.
RAG-powered analysis with typewriter animation and source-event transparency.
Expandable top-10 similar events table with individual CSV export.

Global Filters

Year range slider (2019–2024)
Cluster dropdown (multi-select)
High-risk-only toggle

Selected Code Highlights

Early Warning Index Computation

def calc_priority(r, s):
    """Priority = Risk level score + Severity score"""
    r_score = 2 if 'high' in str(r).lower() else (1 if 'medium' in str(r).lower() else 0)
    s_score = 3 if ('major' in str(s).lower() or 'serious' in str(s).lower()) \
              else (2 if 'significant' in str(s).lower() else 1)
    return r_score + s_score

# Per-cluster EWI aggregation
ewi_df['priority_score'] = ewi_df.apply(
    lambda x: calc_priority(x.get(risk_col, ''), x.get(sev_col, '')), axis=1
)
ewi_df['is_hp'] = ewi_df['priority_score'] >= 4
ewi_df['is_nm'] = ewi_df[cat_col].str.lower().isin(['near miss', 'nearmiss', 'near_miss'])

agg['ewi'] = (agg['near_miss_rate']/100.0) \
            * (agg['hp_within_nm']/100.0) \
            * np.log1p(agg['n_cases'])

GCP Vector Search Index Setup

model = TextEmbeddingModel.from_pretrained("text-embedding-004")
texts = events_df['rag_content'].tolist()

# Batch embedding to stay under token limits
for i in range(0, len(texts), 25):
    batch_results = model.get_embeddings(texts[i:i+25])
    embeddings.extend([emb.values for emb in batch_results])

# Create Vertex AI Vector Search index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="methanex_safety_index",
    contents_delta_uri=f"gs://{BUCKET_NAME}/index_data",
    dimensions=768,
    approximate_neighbors_count=10,
    distance_measure_type="DOT_PRODUCT_DISTANCE"
)

# Deploy to public endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="methanex_safety_endpoint",
    public_endpoint_enabled=True
)
my_index_endpoint.deploy_index(index=my_index, deployed_index_id="methanex_deployed_index")

Client-Side KPI Counter Animation (JavaScript)

function animateValue(id, start, end, duration, isPercent) {
    let obj = document.getElementById(id);
    let startTime = null;
    const step = (timestamp) => {
        if (!startTime) startTime = timestamp;
        const progress = Math.min((timestamp - startTime) / duration, 1);
        const easeProgress = 1 - Math.pow(1 - progress, 3);  // cubic ease-out
        let current = easeProgress * (end - start) + start;
        if (isPercent) obj.innerHTML = current.toFixed(1) + "%";
        else obj.innerHTML = Math.floor(current);
        if (progress < 1) window.requestAnimationFrame(step);
    };
    window.requestAnimationFrame(step);
}

Key Findings & Recommendations

Focus on dominant clusters: The Pareto analysis module shows which operational clusters account for the highest volume of reports.
Target high-severity ratios: Prioritize prevention on clusters with a high ratio of Incidents to Near-Misses, where current defenses are frequently failing.
Proactive monitoring: Timeline trend analysis lets Methanex spot emerging risks (e.g., a spike in IT/AI-related exposures in 2024) before they become systemic.
Faster learning loop: Incident → insight → prevention focus cycle is accelerated by the AI advisor's real-time retrieval and structured recommendations.
Consistent access: All lessons learned are searchable, reducing time-to-guidance for frontline users.

How to Run Locally

Repository Structure

methanex-safety-intelligence/
├── data/
│   ├── events_clean.csv
│   ├── actions_clean.csv
│   └── case_cluster_map.csv
├── assets/
│   └── style.css
├── app.py                   # Main Dash application (~860 lines)
├── rag_engine.py            # Vertex AI / Gemini RAG module
├── setup_cloud.py           # One-time GCP Vector Search setup
├── data_processing.py       # Data loading & visualization helpers
├── requirements.txt
├── .env.example
└── README.md

Prerequisites

Create a GCP project and enable the Vertex AI API.
Configure Vertex AI Vector Search (index + endpoint).
Create a service account, download its JSON key as gcp-key.json.
Copy .env.example to .env and fill in your GCP project ID, location, endpoint IDs, and bucket name.

Run

pip install -r requirements.txt

# First time only — set up the vector search index (~60 min)
python setup_cloud.py

# Launch the dashboard
python app.py
# Navigate to http://127.0.0.1:8050/

Contents