# DSP Platform Docker Setup This repository now includes a Docker-based development environment that brings up: - **PHP + Apache** web server (with Rscript available for the automated analyses) - **MySQL 8.0** database seeded with the `db/niph_dsps.sql` dump on first run - **phpMyAdmin** for administering the database through the browser - **JupyterHub (per-user R-enabled JupyterLab)** for isolated notebook environments ## Prerequisites - Docker Desktop (or Docker Engine + Docker Compose plugin) - ~2 GB of free disk space for the base images ## Quick start ```bash # From the project root docker-compose up --build ``` Once the stack is healthy you can reach the services at: | Service | URL | Notes | |-----------------|------------------------------|-------| | PHP application | http://localhost:8082 | Uses DB credentials from `docker-compose.yml` | | phpMyAdmin | http://localhost:8081 | Login with `dsp_user` / `dsp_pass` (or MySQL root) | | JupyterHub | https://localhost | OAuth handshake redirects to your private notebook (published on port 443) | | MySQL | localhost:3307 (host access) | Database `niph_dsps`, user `dsp_user` / `dsp_pass` | The first `docker-compose up` will import `db/niph_dsps.sql` automatically. Subsequent runs keep the data volume (`mysql_data`). ## Configuration Key environment variables are defined in `docker-compose.yml`. Adjust them if you need different credentials or ports. The PHP application now reads its database configuration from the following variables (with sensible defaults for non-Docker setups): - `DB_HOST` - `DB_PORT` - `DB_NAME` - `DB_USER` - `DB_PASS` `api/run_r_script.php` also honours `RSCRIPT_PATH` if you need to override the default location of the `Rscript` executable. When the portal is hosted on a different hostname (for example, an Ubuntu server on your LAN), set the following variables—either in your shell or a `.env` file consumed by Docker Compose—to keep the embedded JupyterHub session aligned with browser security rules: - `JUPYTER_EXTERNAL_URL` – full base URL that the PHP app should point at (e.g. `https://niphdev.local`) - `JUPYTERHUB_PORT` – published port if you map JupyterHub to something other than `443` (legacy deployments can continue to set `JUPYTER_PORT`) - `DSP_APP_ORIGINS` – space-separated list of origins allowed to call notebook APIs (CORS) - `DSP_FRAME_ANCESTORS` – space-separated list of origins permitted to embed JupyterHub in an iframe ### Platform roles at a glance The application enforces the following roles via `ist_tbl_users.isu_status` and the helper functions in `includes/auth.php`. Use this matrix to confirm which actions (upload, read, download, approve) each role can take before issuing credentials: | Role | Primary workspace | Upload / manage data sources | Approve access requests | Request / read / download datasets | Jupyter / R access | |------|-------------------|------------------------------|-------------------------|------------------------------------|--------------------| | **DAC Staff** | `admin/` area | ✅ Full oversight of every dataset, classification, and content entry. | ✅ Manage any permission, revoke and audit usage. | ✅ Can impersonate workflows when testing, but typically not used for research downloads. | ✅ Enable per-user via `isu_can_run_r`; also seeds OAuth credentials. | | **Data Owner** | `data_owner/` | ✅ Create and maintain their own catalogue entries and metadata. | ✅ Approve, reject, or revoke requests for the data they own. | ✅ Access their own approved files plus anything they have requested from others. | ✅ Optional; grant by setting `isu_can_run_r = 1`. Only approved files sync into their notebook. | | **Data Contributor** | `data_hybrid/` | ✅ Similar to owners, contributors can upload/publish datasets delegated to them. | ✅ Limited to the resources they registered or steward. | ✅ Can request access to other datasets and, once approved, read/download/analyze. | ✅ Optional per account; ideal for analysts who both publish and consume data. | | **Data User** | `data_user/` | ❌ Browse-only catalogue view. | ❌ Cannot approve requests. | ✅ May request access, then read/download once a Data Owner or DAC Staff approves the request. | ✅ Optional; if enabled, only their approved files appear in Jupyter. | > **Tip:** updating a user’s role or R access flag happens under **Admin → Manage Users**. Toggle the “Allow R/Jupyter” switch to control whether uploads are synchronized into their personal notebook volume. To wire DSP into JupyterHub via OAuth, also provide: - `DSP_OAUTH_CLIENT_ID` / `DSP_OAUTH_CLIENT_SECRET` - `DSP_OAUTH_AUTHORIZE_URL`, `DSP_OAUTH_TOKEN_URL`, `DSP_OAUTH_USERINFO_URL` - `JUPYTERHUB_OAUTH_CALLBACK` - `JUPYTERHUB_USER_PATH` and `JUPYTERHUB_USERNAME_TEMPLATE` if you need custom routing/usernames - `JUPYTERHUB_CULL_API_TOKEN` (optional) – set to enable the idle culler service Seed or update the OAuth client after setting these env vars: ```bash docker-compose exec app php scripts/seed_jupyterhub_client.php ``` The JupyterHub deployment trusts requests and iframe parents from `localhost:8082`, `127.0.0.1:8082`, and `https://dsp.niph.org.kh` by default. To allow different origins (for example your own DSP deployment), set: - `DSP_APP_ORIGINS` – space-separated list of origins that should be accepted for CORS/websocket requests (e.g. `DSP_APP_ORIGINS="https://dsp.niph.org.kh"`). - `DSP_FRAME_ANCESTORS` – space-separated list of origins allowed to embed the notebook in an iframe (e.g. `DSP_FRAME_ANCESTORS="https://dsp.niph.org.kh"`). JupyterHub is published on host port `443` (configurable via the `JUPYTERHUB_PORT` environment variable in `docker-compose.yml`), so a deployment reachable at `https://dsp.niph.org.kh` works out of the box. ## Project directories shared with containers | Host directory | Container (app) | Container (Jupyter) | |-----------------------|-------------------------|------------------------------------| | `.` (project root) | `/var/www/html` | – | | `r_scripts/` | `/var/www/html/r_scripts` | `/home/jovyan/work/r_scripts` | | `uploads/jupyter_workspace` | `/var/www/html/uploads/jupyter_workspace` | `/home/jovyan/work` (per-user mount inside spawned notebook) | Uploads remain writable from the PHP container. If you run into permission warnings on macOS/Linux, `chmod -R 777 uploads` (or a tighter group-based permission) on the host usually resolves it. The path is bind-mounted into the `dsp_app` container, so ensure permissions are adjusted on the host side. - Uploaded files are stored under `uploads/datasources/` with names like `datasource__.ext`. This keeps paths unique while preserving a readable hint of the original filename. The default PHP upload limit is set to `20M` (see `docker/custom.ini`). - The `logs/app.log` file (created via `config.php`) records upload activity—if you do not see `[DataSource]` entries after an upload, confirm the app container can reach MySQL (`docker exec dsp_app php -r 'require "config.php"; echo "connected";'`). ## Architecture Overview ```mermaid graph LR subgraph Client U[Browser / API Consumer] end subgraph Docker Stack A[PHP + Apache
dsp_app] B[(MySQL 8.0
dsp_db)] C[phpMyAdmin
dsp_phpmyadmin] D[Jupyter Notebook
dsp_jupyter] V1[(uploads/datasources)] V2[(r_scripts)] end U -->|HTTPS/HTTP :8082| A U -->|HTTPS/HTTP :8081| C U -->|HTTPS :443| D A <-->|SQL :3306| B C -->|Admin SQL| B A -.shared volume .-> V1 A -.shared volume .-> V2 D -.shared volume .-> V1 D -.shared volume .-> V2 ``` *Traffic legend:* solid lines represent runtime traffic, dotted lines represent bind-mounted volumes that synchronize datasets and R scripts between containers. > Need the raw Mermaid for presentations? See `assets/diagrams/data_ecosystem.mmd`. ## Data Model Snapshot ```mermaid erDiagram IST_TBL_PEOPLE ||--o{ IST_TBL_USERS : "fkisp_id_of" IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE : "fkisp_id_of" DSPS_TBL_TYPEDATASOURCE ||--o{ DSPS_TBL_DATASOURCE : "fkdspstds_id" DSPS_TBL_DSPSCATEGORY ||--o{ DSPS_TBL_DATASOURCE : "fkdspscate_id" DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkdspsds_id" IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkisp_id_of (requester)" DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkdspsdsused_id" IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkisp_id_of (consumer)" ``` The diagram highlights how every dataset anchors to a person record, while permissions and usage logs capture cross-person interactions for auditing. ## Analytics Catalog Analytics scripts live in `r_scripts/` and are exposed through `api/run_r_script.php`. Each script receives two CLI arguments: the absolute path to a CSV prepared by PHP and a JSON string of runtime parameters. | Script | Purpose | Required Parameters | Optional Parameters | Output | |--------|---------|---------------------|---------------------|--------| | `data_summary.R` | Smoke-test script that confirms connectivity between PHP and R, echoing the received file path and parameters. | _None_ | Any JSON payload is echoed back in `params_received`. | JSON with `message`, `data_file`, and the raw parameter string. | | `descriptive_stats.R` | Generates descriptive statistics for every numeric column (count, mean, median, SD, min, max, missing) and returns up to five preview rows. | _None_ (operates on all numeric columns). | `encoding` (default `UTF-8`), `guess_max` to control type inference. | JSON payload containing `numeric_columns` keyed by column name plus `sample_rows`. Missing values are encoded as `null`. | | `category_frequency.R` | Builds a frequency distribution for a categorical column. Useful for validating controlled vocabularies or spotting dominant categories. | `column` – name of the column to profile. | `top_n` (default `10`), `encoding` (default `UTF-8`), `include_missing` (`false` by default). | JSON with the analyzed column, configuration echo, and `frequencies` (value/count rows) sorted by frequency. | ### Adding another R script 1. Drop the script into `r_scripts/` and ensure it prints JSON via `jsonlite::toJSON(...)`. 2. Append the filename and human-readable label to `$allowed_r_scripts` inside `api/run_r_script.php`. 3. Document the new script in the table above so stakeholders understand its expected parameters and output contract. ## Useful commands ```bash # Stop and remove containers, keeping the database volume docker-compose down # Stop containers and remove the database volume (fresh start) docker-compose down -v # Tail logs from all services docker-compose logs -f ``` ## Running Tests PHPUnit is configured via Composer: ```bash # Install dependencies (first run) composer install # Execute the test suite composer test ``` If you prefer running inside the app container: ```bash docker-compose exec app composer install docker-compose exec app composer test ``` ## Troubleshooting - **MySQL already initialised**: remove the `mysql_data` named volume (`docker-compose down -v`) to force a clean import. - **Rscript not found**: ensure the PHP container has R installed (`docker-compose build` again). Set `RSCRIPT_PATH` in `docker-compose.yml` if R lives elsewhere. - **Port clashes**: adjust the published ports (`8082`, `8081`, `443`, `3307`) in `docker-compose.yml` to free ones on your machine. - **Need the OAuth tables?**: run `docker-compose exec db mysql -u root -p niph_dsps < db/migrations/20241103_oauth_tables.sql` then insert your JupyterHub client credentials. Happy hacking!