This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows.
1. Core Principles#
- On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob).
- No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto).
- Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron).
- Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL).
2. Data Ingestion#
Source Type | Examples | Mechanism | Notes |
---|
Geoscience | LAS, SEG-Y | Batch file drop + ETL parse | Large binary + metadata extraction |
Well / Ops | WITSML feeds | Batch pull / scheduled parse | Standardization step into Hive |
ERP / Finance | CSV / RDBMS exports | Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL | Controlled nightly cadence |
SCADA / Events | Downtime logs | Kafka 0.8/0.9 (where deployed) or Flume/Logstash | Early streaming footprint |
Market / Pricing | Excel price decks | Staged in SQL then approved to config tables | Manual approval workflow |
Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization.
- Raw Zone: HDFS (gzip/original form) + limited S3 buckets (hybrid tests).
- Curated Zone: Hive tables (Parquet/ORC), partitioned (date, asset, well_id).
- Reservoir simulation outputs: Eclipse / CMG binary grids + summary CSV parsed into Hive fact tables.
- Master data (wells, facilities, contracts): SQL Server / Oracle system-of-record, mirrored to Hive dims.
4. Processing & Modeling Stack#
Layer | Tools / Tech | Purpose |
---|
Batch Parallel | Spark 1.6 → 2.x | Monte Carlo sample gen, decline curve expansion, large joins |
Statistical / Valuation | Python 2.7/3.5 (NumPy, SciPy, Pandas) | Economic metrics, real options calc |
Advanced Stats | R (forecast, variance decomposition) | Specialized fitting & diagnostic tests |
HPC | SLURM / PBS / GridEngine | Reservoir realizations & large path batches |
Options | Custom Python/R (binomial, Longstaff–Schwartz) | Option premium computations |
Correlation Sampling | Python (Cholesky) | Reproducible factor draws |
Random seeds & correlation matrices persisted to control tables for reproducibility.
5. Fiscal & Economic Engine#
- Core cash waterfall: SQL stored procedures (SQL Server) or Python module for flexibility.
- Scenario parameter schemas:
price_deck
, capex_curve
, opex_class
, tax_regime
, option_trigger
. - Run logging:
run_log
table referencing immutable parameter version rows.
6. Parameter & Assumption Management#
- Version tables:
reserves_version
, price_deck_version
, cost_curve_version
. - Approval workflow: SharePoint / email sign-off toggling
approved=Y
flag. - Identity via composite keys (model_version + parameter_version_ids); minimal hashing at the time.
7. Workflow Orchestration (Typical DAG)#
- Extract latest approved parameters (SQL).
- Generate production forecasts (Spark decline expansion).
- Launch Monte Carlo batches (Spark jobs or HPC submissions).
- Aggregate distributions (Spark reduce / SQL summarize).
- Apply real options overlay (Python step).
- Persist summary percentiles (Hive + SQL Server).
- Notify stakeholders (email distribution).
8. Reporting & Analytics#
- Executive BI: Spotfire / Tableau / early Power BI (NPV P10/P50/P90, tornado drivers).
- Ad hoc: Excel pivots via ODBC to curated SQL views.
- Storage optimization: Percentiles persisted; optional compressed binary full-path arrays (NFS) for audit replay.
- Delta comparisons: SQL view diffing latest run vs prior approved baseline.
9. Integration with Business Units#
Domain | Inbound Flow | Processing | Output Target |
---|
Geology | Petrel / Eclipse exports | ETL parse | Hive reservoir summaries |
Drilling | Daily reports (PDF→manual XLS) | Structured capture | SQL + Hive staging |
Production Ops | Hourly SCADA snapshot | Kafka / batch ingest | Event log / downtime model |
Marketing | Monthly price deck Excel | Controlled ETL & approval | price_deck_version |
Finance | ERP CAPEX/OPEX nightly | Sqoop import | Cost tables / allocation feeds |
Risk | Correlation matrix | Import to risk_correlation module | Simulation inputs |
Land / Infrastructure | Tariff & capacity schedules | Direct table loads | Constraint models |
10. Security & Access#
- Hadoop: Kerberos (where maturity allowed) else OS ACLs.
- SQL Server roles:
FINANCE_READ
, GEO_WRITE
, ECON_ADMIN
. - File share oversight: Manual audits (Windows DFS / NFS export lists).
11. Governance & Data Quality#
- Python DQ scripts logging exceptions to
dq_issue
(rule_id, entity, severity, timestamp). - Monthly DQ review & remediation tracking (manual register).
- Lineage approximation: ETL job logs + timestamped folder patterns (no full automated lineage tooling yet).
12. Carbon / ESG (Early Stage)#
- Emission factor reference table (fuel gas, flaring, venting).
- Static carbon price scenarios (low / base / high) feeding OPEX line.
- No dynamic abatement / marginal cost optimization at that stage.
13. Real Options Implementation Constraints#
- Compute elasticity limited → overnight batch window; option runs weekly not continuous.
- Focused scope: delay sanction, phased expansion, early abandonment (highest value density).
- Longstaff–Schwartz regression on reduced path sample (≈2k) to conserve resources.
- Pre-aggregate well groups to reduce simulation dimensionality.
- Spark broadcast for small dimension joins.
- Persist intermediate Monte Carlo artifacts (sampled price paths) for reuse.
- Convergence guardrail: escalate run count only when ENPV band instability > tolerance.
15. Monitoring & Logging#
- Scheduler logs (Oozie) +
run_log
status, duration, record counts. - Simple email alerts on failure; optional ELK (if PoC adopted) but not standardized.
16. Migration Path / Future Evolution#
Evolution Step | Rationale | Outcome Target |
---|
Broader Airflow adoption | Transparency & modular DAG mgmt | Replace opaque XML workflows |
MLflow / experiment tracking | Trace model / parameter evolution | Reproducibility maturity |
Object store full-path persistence | Richer post-hoc analytics | Scenario drill-down, re-simulation |
Metadata / lineage catalog (DataHub etc.) | Governance & impact analysis | Faster change assessment |
Containerization & elasticity | Reduce queue latency | On-demand scaling |
17. Architectural Summary Diagram#
flowchart TD
Sources["`Sources`"]
Raw["`Raw Zone (HDFS)`"]
Curated["Curated (Hive)"]
Marts["Marts (Impala/Presto)"]
Reservoir["Reservoir Grids"]
Engines["Economic / Option Engines (Py/R)"]
Outputs["Simulation Outputs (Hive + SQL)"]
BI["Reporting / BI"]
Sources --> Raw --> Curated --> Marts
Raw --> Reservoir
Curated --> Engines
Engines --> Outputs --> BI
Reservoir --> Outputs
18. Key Constraints vs Modern Design#
2015–2018 Limitation | Modern Remedy |
---|
Limited elasticity (static cluster) | Serverless / autoscaling compute (Spark on k8s, Databricks) |
Manual catalog & lineage | Active metadata graph (DataHub, OpenLineage) |
Weekly options cycle | Event-driven incremental valuation |
Percentiles-only storage | Columnar archival of full paths (Parquet / Delta) |
Ad hoc approval workflow | Policy-driven parameter governance platform |
Historical architecture reference; complements the upstream evaluation modeling framework.