first commit

This commit is contained in:
Sok Ponlork
2026-01-29 14:30:23 +07:00
commit 951262afb3

204
README.md Normal file
View File

@@ -0,0 +1,204 @@
# DSP Platform Docker Setup
This repository now includes a Docker-based development environment that brings up:
- **PHP + Apache** web server (with Rscript available for the automated analyses)
- **MySQL 8.0** database seeded with the `db/niph_dsps.sql` dump on first run
- **phpMyAdmin** for administering the database through the browser
- **JupyterHub (per-user R-enabled JupyterLab)** for isolated notebook environments
## Prerequisites
- Docker Desktop (or Docker Engine + Docker Compose plugin)
- ~2 GB of free disk space for the base images
## Quick start
```bash
# From the project root
docker-compose up --build
```
Once the stack is healthy you can reach the services at:
| Service | URL | Notes |
|-----------------|------------------------------|-------|
| PHP application | http://localhost:8082 | Uses DB credentials from `docker-compose.yml` |
| phpMyAdmin | http://localhost:8081 | Login with `dsp_user` / `dsp_pass` (or MySQL root) |
| JupyterHub | https://localhost | OAuth handshake redirects to your private notebook (published on port 443) |
| MySQL | localhost:3307 (host access) | Database `niph_dsps`, user `dsp_user` / `dsp_pass` |
The first `docker-compose up` will import `db/niph_dsps.sql` automatically. Subsequent runs keep the data volume (`mysql_data`).
## Configuration
Key environment variables are defined in `docker-compose.yml`. Adjust them if you need different credentials or ports. The PHP application now reads its database configuration from the following variables (with sensible defaults for non-Docker setups):
- `DB_HOST`
- `DB_PORT`
- `DB_NAME`
- `DB_USER`
- `DB_PASS`
`api/run_r_script.php` also honours `RSCRIPT_PATH` if you need to override the default location of the `Rscript` executable.
When the portal is hosted on a different hostname (for example, an Ubuntu server on your LAN), set the following variables—either in your shell or a `.env` file consumed by Docker Compose—to keep the embedded JupyterHub session aligned with browser security rules:
- `JUPYTER_EXTERNAL_URL` full base URL that the PHP app should point at (e.g. `https://niphdev.local`)
- `JUPYTERHUB_PORT` published port if you map JupyterHub to something other than `443` (legacy deployments can continue to set `JUPYTER_PORT`)
- `DSP_APP_ORIGINS` space-separated list of origins allowed to call notebook APIs (CORS)
- `DSP_FRAME_ANCESTORS` space-separated list of origins permitted to embed JupyterHub in an iframe
### Platform roles at a glance
The application enforces the following roles via `ist_tbl_users.isu_status` and the helper functions in `includes/auth.php`. Use this matrix to confirm which actions (upload, read, download, approve) each role can take before issuing credentials:
| Role | Primary workspace | Upload / manage data sources | Approve access requests | Request / read / download datasets | Jupyter / R access |
|------|-------------------|------------------------------|-------------------------|------------------------------------|--------------------|
| **DAC Staff** | `admin/` area | ✅ Full oversight of every dataset, classification, and content entry. | ✅ Manage any permission, revoke and audit usage. | ✅ Can impersonate workflows when testing, but typically not used for research downloads. | ✅ Enable per-user via `isu_can_run_r`; also seeds OAuth credentials. |
| **Data Owner** | `data_owner/` | ✅ Create and maintain their own catalogue entries and metadata. | ✅ Approve, reject, or revoke requests for the data they own. | ✅ Access their own approved files plus anything they have requested from others. | ✅ Optional; grant by setting `isu_can_run_r = 1`. Only approved files sync into their notebook. |
| **Data Contributor** | `data_hybrid/` | ✅ Similar to owners, contributors can upload/publish datasets delegated to them. | ✅ Limited to the resources they registered or steward. | ✅ Can request access to other datasets and, once approved, read/download/analyze. | ✅ Optional per account; ideal for analysts who both publish and consume data. |
| **Data User** | `data_user/` | ❌ Browse-only catalogue view. | ❌ Cannot approve requests. | ✅ May request access, then read/download once a Data Owner or DAC Staff approves the request. | ✅ Optional; if enabled, only their approved files appear in Jupyter. |
> **Tip:** updating a users role or R access flag happens under **Admin → Manage Users**. Toggle the “Allow R/Jupyter” switch to control whether uploads are synchronized into their personal notebook volume.
To wire DSP into JupyterHub via OAuth, also provide:
- `DSP_OAUTH_CLIENT_ID` / `DSP_OAUTH_CLIENT_SECRET`
- `DSP_OAUTH_AUTHORIZE_URL`, `DSP_OAUTH_TOKEN_URL`, `DSP_OAUTH_USERINFO_URL`
- `JUPYTERHUB_OAUTH_CALLBACK`
- `JUPYTERHUB_USER_PATH` and `JUPYTERHUB_USERNAME_TEMPLATE` if you need custom routing/usernames
- `JUPYTERHUB_CULL_API_TOKEN` (optional) set to enable the idle culler service
Seed or update the OAuth client after setting these env vars:
```bash
docker-compose exec app php scripts/seed_jupyterhub_client.php
```
The JupyterHub deployment trusts requests and iframe parents from `localhost:8082`, `127.0.0.1:8082`, and `https://dsp.niph.org.kh` by default. To allow different origins (for example your own DSP deployment), set:
- `DSP_APP_ORIGINS` space-separated list of origins that should be accepted for CORS/websocket requests (e.g. `DSP_APP_ORIGINS="https://dsp.niph.org.kh"`).
- `DSP_FRAME_ANCESTORS` space-separated list of origins allowed to embed the notebook in an iframe (e.g. `DSP_FRAME_ANCESTORS="https://dsp.niph.org.kh"`).
JupyterHub is published on host port `443` (configurable via the `JUPYTERHUB_PORT` environment variable in `docker-compose.yml`), so a deployment reachable at `https://dsp.niph.org.kh` works out of the box.
## Project directories shared with containers
| Host directory | Container (app) | Container (Jupyter) |
|-----------------------|-------------------------|------------------------------------|
| `.` (project root) | `/var/www/html` | |
| `r_scripts/` | `/var/www/html/r_scripts` | `/home/jovyan/work/r_scripts` |
| `uploads/jupyter_workspace` | `/var/www/html/uploads/jupyter_workspace` | `/home/jovyan/work` (per-user mount inside spawned notebook) |
Uploads remain writable from the PHP container. If you run into permission warnings on macOS/Linux,
`chmod -R 777 uploads` (or a tighter group-based permission) on the host usually resolves it. The path is bind-mounted into the `dsp_app` container, so ensure permissions are adjusted on the host side.
- Uploaded files are stored under `uploads/datasources/` with names like `datasource_<unique>_<original-stem>.ext`. This keeps paths unique while preserving a readable hint of the original filename. The default PHP upload limit is set to `20M` (see `docker/custom.ini`).
- The `logs/app.log` file (created via `config.php`) records upload activity—if you do not see `[DataSource]` entries after an upload, confirm the app container can reach MySQL (`docker exec dsp_app php -r 'require "config.php"; echo "connected";'`).
## Architecture Overview
```mermaid
graph LR
subgraph Client
U[Browser / API Consumer]
end
subgraph Docker Stack
A[PHP + Apache<br/>dsp_app]
B[(MySQL 8.0<br/>dsp_db)]
C[phpMyAdmin<br/>dsp_phpmyadmin]
D[Jupyter Notebook<br/>dsp_jupyter]
V1[(uploads/datasources)]
V2[(r_scripts)]
end
U -->|HTTPS/HTTP :8082| A
U -->|HTTPS/HTTP :8081| C
U -->|HTTPS :443| D
A <-->|SQL :3306| B
C -->|Admin SQL| B
A -.shared volume .-> V1
A -.shared volume .-> V2
D -.shared volume .-> V1
D -.shared volume .-> V2
```
*Traffic legend:* solid lines represent runtime traffic, dotted lines represent bind-mounted volumes that synchronize datasets and R scripts between containers.
> Need the raw Mermaid for presentations? See `assets/diagrams/data_ecosystem.mmd`.
## Data Model Snapshot
```mermaid
erDiagram
IST_TBL_PEOPLE ||--o{ IST_TBL_USERS : "fkisp_id_of"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE : "fkisp_id_of"
DSPS_TBL_TYPEDATASOURCE ||--o{ DSPS_TBL_DATASOURCE : "fkdspstds_id"
DSPS_TBL_DSPSCATEGORY ||--o{ DSPS_TBL_DATASOURCE : "fkdspscate_id"
DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkdspsds_id"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkisp_id_of (requester)"
DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkdspsdsused_id"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkisp_id_of (consumer)"
```
The diagram highlights how every dataset anchors to a person record, while permissions and usage logs capture cross-person interactions for auditing.
## Analytics Catalog
Analytics scripts live in `r_scripts/` and are exposed through `api/run_r_script.php`. Each script receives two CLI arguments: the absolute path to a CSV prepared by PHP and a JSON string of runtime parameters.
| Script | Purpose | Required Parameters | Optional Parameters | Output |
|--------|---------|---------------------|---------------------|--------|
| `data_summary.R` | Smoke-test script that confirms connectivity between PHP and R, echoing the received file path and parameters. | _None_ | Any JSON payload is echoed back in `params_received`. | JSON with `message`, `data_file`, and the raw parameter string. |
| `descriptive_stats.R` | Generates descriptive statistics for every numeric column (count, mean, median, SD, min, max, missing) and returns up to five preview rows. | _None_ (operates on all numeric columns). | `encoding` (default `UTF-8`), `guess_max` to control type inference. | JSON payload containing `numeric_columns` keyed by column name plus `sample_rows`. Missing values are encoded as `null`. |
| `category_frequency.R` | Builds a frequency distribution for a categorical column. Useful for validating controlled vocabularies or spotting dominant categories. | `column` name of the column to profile. | `top_n` (default `10`), `encoding` (default `UTF-8`), `include_missing` (`false` by default). | JSON with the analyzed column, configuration echo, and `frequencies` (value/count rows) sorted by frequency. |
### Adding another R script
1. Drop the script into `r_scripts/` and ensure it prints JSON via `jsonlite::toJSON(...)`.
2. Append the filename and human-readable label to `$allowed_r_scripts` inside `api/run_r_script.php`.
3. Document the new script in the table above so stakeholders understand its expected parameters and output contract.
## Useful commands
```bash
# Stop and remove containers, keeping the database volume
docker-compose down
# Stop containers and remove the database volume (fresh start)
docker-compose down -v
# Tail logs from all services
docker-compose logs -f
```
## Running Tests
PHPUnit is configured via Composer:
```bash
# Install dependencies (first run)
composer install
# Execute the test suite
composer test
```
If you prefer running inside the app container:
```bash
docker-compose exec app composer install
docker-compose exec app composer test
```
## Troubleshooting
- **MySQL already initialised**: remove the `mysql_data` named volume (`docker-compose down -v`) to force a clean import.
- **Rscript not found**: ensure the PHP container has R installed (`docker-compose build` again). Set `RSCRIPT_PATH` in `docker-compose.yml` if R lives elsewhere.
- **Port clashes**: adjust the published ports (`8082`, `8081`, `443`, `3307`) in `docker-compose.yml` to free ones on your machine.
- **Need the OAuth tables?**: run `docker-compose exec db mysql -u root -p niph_dsps < db/migrations/20241103_oauth_tables.sql` then insert your JupyterHub client credentials.
Happy hacking!