commit 951262afb39589f22a039aee227406dd962b4381 Author: Sok Ponlork Date: Thu Jan 29 14:30:23 2026 +0700 first commit diff --git a/README.md b/README.md new file mode 100644 index 0000000..5b71de2 --- /dev/null +++ b/README.md @@ -0,0 +1,204 @@ +# DSP Platform Docker Setup + +This repository now includes a Docker-based development environment that brings up: + +- **PHP + Apache** web server (with Rscript available for the automated analyses) +- **MySQL 8.0** database seeded with the `db/niph_dsps.sql` dump on first run +- **phpMyAdmin** for administering the database through the browser +- **JupyterHub (per-user R-enabled JupyterLab)** for isolated notebook environments + +## Prerequisites + +- Docker Desktop (or Docker Engine + Docker Compose plugin) +- ~2 GB of free disk space for the base images + +## Quick start + +```bash +# From the project root +docker-compose up --build +``` + +Once the stack is healthy you can reach the services at: + +| Service | URL | Notes | +|-----------------|------------------------------|-------| +| PHP application | http://localhost:8082 | Uses DB credentials from `docker-compose.yml` | +| phpMyAdmin | http://localhost:8081 | Login with `dsp_user` / `dsp_pass` (or MySQL root) | +| JupyterHub | https://localhost | OAuth handshake redirects to your private notebook (published on port 443) | +| MySQL | localhost:3307 (host access) | Database `niph_dsps`, user `dsp_user` / `dsp_pass` | + +The first `docker-compose up` will import `db/niph_dsps.sql` automatically. Subsequent runs keep the data volume (`mysql_data`). + +## Configuration + +Key environment variables are defined in `docker-compose.yml`. Adjust them if you need different credentials or ports. The PHP application now reads its database configuration from the following variables (with sensible defaults for non-Docker setups): + +- `DB_HOST` +- `DB_PORT` +- `DB_NAME` +- `DB_USER` +- `DB_PASS` + +`api/run_r_script.php` also honours `RSCRIPT_PATH` if you need to override the default location of the `Rscript` executable. + +When the portal is hosted on a different hostname (for example, an Ubuntu server on your LAN), set the following variables—either in your shell or a `.env` file consumed by Docker Compose—to keep the embedded JupyterHub session aligned with browser security rules: + +- `JUPYTER_EXTERNAL_URL` – full base URL that the PHP app should point at (e.g. `https://niphdev.local`) +- `JUPYTERHUB_PORT` – published port if you map JupyterHub to something other than `443` (legacy deployments can continue to set `JUPYTER_PORT`) +- `DSP_APP_ORIGINS` – space-separated list of origins allowed to call notebook APIs (CORS) +- `DSP_FRAME_ANCESTORS` – space-separated list of origins permitted to embed JupyterHub in an iframe + +### Platform roles at a glance + +The application enforces the following roles via `ist_tbl_users.isu_status` and the helper functions in `includes/auth.php`. Use this matrix to confirm which actions (upload, read, download, approve) each role can take before issuing credentials: + +| Role | Primary workspace | Upload / manage data sources | Approve access requests | Request / read / download datasets | Jupyter / R access | +|------|-------------------|------------------------------|-------------------------|------------------------------------|--------------------| +| **DAC Staff** | `admin/` area | ✅ Full oversight of every dataset, classification, and content entry. | ✅ Manage any permission, revoke and audit usage. | ✅ Can impersonate workflows when testing, but typically not used for research downloads. | ✅ Enable per-user via `isu_can_run_r`; also seeds OAuth credentials. | +| **Data Owner** | `data_owner/` | ✅ Create and maintain their own catalogue entries and metadata. | ✅ Approve, reject, or revoke requests for the data they own. | ✅ Access their own approved files plus anything they have requested from others. | ✅ Optional; grant by setting `isu_can_run_r = 1`. Only approved files sync into their notebook. | +| **Data Contributor** | `data_hybrid/` | ✅ Similar to owners, contributors can upload/publish datasets delegated to them. | ✅ Limited to the resources they registered or steward. | ✅ Can request access to other datasets and, once approved, read/download/analyze. | ✅ Optional per account; ideal for analysts who both publish and consume data. | +| **Data User** | `data_user/` | ❌ Browse-only catalogue view. | ❌ Cannot approve requests. | ✅ May request access, then read/download once a Data Owner or DAC Staff approves the request. | ✅ Optional; if enabled, only their approved files appear in Jupyter. | + +> **Tip:** updating a user’s role or R access flag happens under **Admin → Manage Users**. Toggle the “Allow R/Jupyter” switch to control whether uploads are synchronized into their personal notebook volume. + +To wire DSP into JupyterHub via OAuth, also provide: + +- `DSP_OAUTH_CLIENT_ID` / `DSP_OAUTH_CLIENT_SECRET` +- `DSP_OAUTH_AUTHORIZE_URL`, `DSP_OAUTH_TOKEN_URL`, `DSP_OAUTH_USERINFO_URL` +- `JUPYTERHUB_OAUTH_CALLBACK` +- `JUPYTERHUB_USER_PATH` and `JUPYTERHUB_USERNAME_TEMPLATE` if you need custom routing/usernames +- `JUPYTERHUB_CULL_API_TOKEN` (optional) – set to enable the idle culler service + +Seed or update the OAuth client after setting these env vars: + +```bash +docker-compose exec app php scripts/seed_jupyterhub_client.php +``` + +The JupyterHub deployment trusts requests and iframe parents from `localhost:8082`, `127.0.0.1:8082`, and `https://dsp.niph.org.kh` by default. To allow different origins (for example your own DSP deployment), set: + +- `DSP_APP_ORIGINS` – space-separated list of origins that should be accepted for CORS/websocket requests (e.g. `DSP_APP_ORIGINS="https://dsp.niph.org.kh"`). +- `DSP_FRAME_ANCESTORS` – space-separated list of origins allowed to embed the notebook in an iframe (e.g. `DSP_FRAME_ANCESTORS="https://dsp.niph.org.kh"`). + +JupyterHub is published on host port `443` (configurable via the `JUPYTERHUB_PORT` environment variable in `docker-compose.yml`), so a deployment reachable at `https://dsp.niph.org.kh` works out of the box. + +## Project directories shared with containers + +| Host directory | Container (app) | Container (Jupyter) | +|-----------------------|-------------------------|------------------------------------| +| `.` (project root) | `/var/www/html` | – | +| `r_scripts/` | `/var/www/html/r_scripts` | `/home/jovyan/work/r_scripts` | +| `uploads/jupyter_workspace` | `/var/www/html/uploads/jupyter_workspace` | `/home/jovyan/work` (per-user mount inside spawned notebook) | + +Uploads remain writable from the PHP container. If you run into permission warnings on macOS/Linux, +`chmod -R 777 uploads` (or a tighter group-based permission) on the host usually resolves it. The path is bind-mounted into the `dsp_app` container, so ensure permissions are adjusted on the host side. + +- Uploaded files are stored under `uploads/datasources/` with names like `datasource__.ext`. This keeps paths unique while preserving a readable hint of the original filename. The default PHP upload limit is set to `20M` (see `docker/custom.ini`). + +- The `logs/app.log` file (created via `config.php`) records upload activity—if you do not see `[DataSource]` entries after an upload, confirm the app container can reach MySQL (`docker exec dsp_app php -r 'require "config.php"; echo "connected";'`). + +## Architecture Overview + +```mermaid +graph LR + subgraph Client + U[Browser / API Consumer] + end + + subgraph Docker Stack + A[PHP + Apache
dsp_app] + B[(MySQL 8.0
dsp_db)] + C[phpMyAdmin
dsp_phpmyadmin] + D[Jupyter Notebook
dsp_jupyter] + V1[(uploads/datasources)] + V2[(r_scripts)] + end + + U -->|HTTPS/HTTP :8082| A + U -->|HTTPS/HTTP :8081| C + U -->|HTTPS :443| D + A <-->|SQL :3306| B + C -->|Admin SQL| B + A -.shared volume .-> V1 + A -.shared volume .-> V2 + D -.shared volume .-> V1 + D -.shared volume .-> V2 +``` + +*Traffic legend:* solid lines represent runtime traffic, dotted lines represent bind-mounted volumes that synchronize datasets and R scripts between containers. + +> Need the raw Mermaid for presentations? See `assets/diagrams/data_ecosystem.mmd`. + +## Data Model Snapshot + +```mermaid +erDiagram + IST_TBL_PEOPLE ||--o{ IST_TBL_USERS : "fkisp_id_of" + IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE : "fkisp_id_of" + DSPS_TBL_TYPEDATASOURCE ||--o{ DSPS_TBL_DATASOURCE : "fkdspstds_id" + DSPS_TBL_DSPSCATEGORY ||--o{ DSPS_TBL_DATASOURCE : "fkdspscate_id" + DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkdspsds_id" + IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkisp_id_of (requester)" + DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkdspsdsused_id" + IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkisp_id_of (consumer)" +``` + +The diagram highlights how every dataset anchors to a person record, while permissions and usage logs capture cross-person interactions for auditing. + +## Analytics Catalog + +Analytics scripts live in `r_scripts/` and are exposed through `api/run_r_script.php`. Each script receives two CLI arguments: the absolute path to a CSV prepared by PHP and a JSON string of runtime parameters. + +| Script | Purpose | Required Parameters | Optional Parameters | Output | +|--------|---------|---------------------|---------------------|--------| +| `data_summary.R` | Smoke-test script that confirms connectivity between PHP and R, echoing the received file path and parameters. | _None_ | Any JSON payload is echoed back in `params_received`. | JSON with `message`, `data_file`, and the raw parameter string. | +| `descriptive_stats.R` | Generates descriptive statistics for every numeric column (count, mean, median, SD, min, max, missing) and returns up to five preview rows. | _None_ (operates on all numeric columns). | `encoding` (default `UTF-8`), `guess_max` to control type inference. | JSON payload containing `numeric_columns` keyed by column name plus `sample_rows`. Missing values are encoded as `null`. | +| `category_frequency.R` | Builds a frequency distribution for a categorical column. Useful for validating controlled vocabularies or spotting dominant categories. | `column` – name of the column to profile. | `top_n` (default `10`), `encoding` (default `UTF-8`), `include_missing` (`false` by default). | JSON with the analyzed column, configuration echo, and `frequencies` (value/count rows) sorted by frequency. | + +### Adding another R script + +1. Drop the script into `r_scripts/` and ensure it prints JSON via `jsonlite::toJSON(...)`. +2. Append the filename and human-readable label to `$allowed_r_scripts` inside `api/run_r_script.php`. +3. Document the new script in the table above so stakeholders understand its expected parameters and output contract. + +## Useful commands + +```bash +# Stop and remove containers, keeping the database volume +docker-compose down + +# Stop containers and remove the database volume (fresh start) +docker-compose down -v + +# Tail logs from all services +docker-compose logs -f +``` + +## Running Tests + +PHPUnit is configured via Composer: + +```bash +# Install dependencies (first run) +composer install + +# Execute the test suite +composer test +``` + +If you prefer running inside the app container: + +```bash +docker-compose exec app composer install +docker-compose exec app composer test +``` + +## Troubleshooting + +- **MySQL already initialised**: remove the `mysql_data` named volume (`docker-compose down -v`) to force a clean import. +- **Rscript not found**: ensure the PHP container has R installed (`docker-compose build` again). Set `RSCRIPT_PATH` in `docker-compose.yml` if R lives elsewhere. +- **Port clashes**: adjust the published ports (`8082`, `8081`, `443`, `3307`) in `docker-compose.yml` to free ones on your machine. +- **Need the OAuth tables?**: run `docker-compose exec db mysql -u root -p niph_dsps < db/migrations/20241103_oauth_tables.sql` then insert your JupyterHub client credentials. + +Happy hacking!