Files
ponlork_1st/README.md
Sok Ponlork 951262afb3 first commit
2026-01-29 14:30:23 +07:00

205 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DSP Platform Docker Setup
This repository now includes a Docker-based development environment that brings up:
- **PHP + Apache** web server (with Rscript available for the automated analyses)
- **MySQL 8.0** database seeded with the `db/niph_dsps.sql` dump on first run
- **phpMyAdmin** for administering the database through the browser
- **JupyterHub (per-user R-enabled JupyterLab)** for isolated notebook environments
## Prerequisites
- Docker Desktop (or Docker Engine + Docker Compose plugin)
- ~2 GB of free disk space for the base images
## Quick start
```bash
# From the project root
docker-compose up --build
```
Once the stack is healthy you can reach the services at:
| Service | URL | Notes |
|-----------------|------------------------------|-------|
| PHP application | http://localhost:8082 | Uses DB credentials from `docker-compose.yml` |
| phpMyAdmin | http://localhost:8081 | Login with `dsp_user` / `dsp_pass` (or MySQL root) |
| JupyterHub | https://localhost | OAuth handshake redirects to your private notebook (published on port 443) |
| MySQL | localhost:3307 (host access) | Database `niph_dsps`, user `dsp_user` / `dsp_pass` |
The first `docker-compose up` will import `db/niph_dsps.sql` automatically. Subsequent runs keep the data volume (`mysql_data`).
## Configuration
Key environment variables are defined in `docker-compose.yml`. Adjust them if you need different credentials or ports. The PHP application now reads its database configuration from the following variables (with sensible defaults for non-Docker setups):
- `DB_HOST`
- `DB_PORT`
- `DB_NAME`
- `DB_USER`
- `DB_PASS`
`api/run_r_script.php` also honours `RSCRIPT_PATH` if you need to override the default location of the `Rscript` executable.
When the portal is hosted on a different hostname (for example, an Ubuntu server on your LAN), set the following variables—either in your shell or a `.env` file consumed by Docker Compose—to keep the embedded JupyterHub session aligned with browser security rules:
- `JUPYTER_EXTERNAL_URL` full base URL that the PHP app should point at (e.g. `https://niphdev.local`)
- `JUPYTERHUB_PORT` published port if you map JupyterHub to something other than `443` (legacy deployments can continue to set `JUPYTER_PORT`)
- `DSP_APP_ORIGINS` space-separated list of origins allowed to call notebook APIs (CORS)
- `DSP_FRAME_ANCESTORS` space-separated list of origins permitted to embed JupyterHub in an iframe
### Platform roles at a glance
The application enforces the following roles via `ist_tbl_users.isu_status` and the helper functions in `includes/auth.php`. Use this matrix to confirm which actions (upload, read, download, approve) each role can take before issuing credentials:
| Role | Primary workspace | Upload / manage data sources | Approve access requests | Request / read / download datasets | Jupyter / R access |
|------|-------------------|------------------------------|-------------------------|------------------------------------|--------------------|
| **DAC Staff** | `admin/` area | ✅ Full oversight of every dataset, classification, and content entry. | ✅ Manage any permission, revoke and audit usage. | ✅ Can impersonate workflows when testing, but typically not used for research downloads. | ✅ Enable per-user via `isu_can_run_r`; also seeds OAuth credentials. |
| **Data Owner** | `data_owner/` | ✅ Create and maintain their own catalogue entries and metadata. | ✅ Approve, reject, or revoke requests for the data they own. | ✅ Access their own approved files plus anything they have requested from others. | ✅ Optional; grant by setting `isu_can_run_r = 1`. Only approved files sync into their notebook. |
| **Data Contributor** | `data_hybrid/` | ✅ Similar to owners, contributors can upload/publish datasets delegated to them. | ✅ Limited to the resources they registered or steward. | ✅ Can request access to other datasets and, once approved, read/download/analyze. | ✅ Optional per account; ideal for analysts who both publish and consume data. |
| **Data User** | `data_user/` | ❌ Browse-only catalogue view. | ❌ Cannot approve requests. | ✅ May request access, then read/download once a Data Owner or DAC Staff approves the request. | ✅ Optional; if enabled, only their approved files appear in Jupyter. |
> **Tip:** updating a users role or R access flag happens under **Admin → Manage Users**. Toggle the “Allow R/Jupyter” switch to control whether uploads are synchronized into their personal notebook volume.
To wire DSP into JupyterHub via OAuth, also provide:
- `DSP_OAUTH_CLIENT_ID` / `DSP_OAUTH_CLIENT_SECRET`
- `DSP_OAUTH_AUTHORIZE_URL`, `DSP_OAUTH_TOKEN_URL`, `DSP_OAUTH_USERINFO_URL`
- `JUPYTERHUB_OAUTH_CALLBACK`
- `JUPYTERHUB_USER_PATH` and `JUPYTERHUB_USERNAME_TEMPLATE` if you need custom routing/usernames
- `JUPYTERHUB_CULL_API_TOKEN` (optional) set to enable the idle culler service
Seed or update the OAuth client after setting these env vars:
```bash
docker-compose exec app php scripts/seed_jupyterhub_client.php
```
The JupyterHub deployment trusts requests and iframe parents from `localhost:8082`, `127.0.0.1:8082`, and `https://dsp.niph.org.kh` by default. To allow different origins (for example your own DSP deployment), set:
- `DSP_APP_ORIGINS` space-separated list of origins that should be accepted for CORS/websocket requests (e.g. `DSP_APP_ORIGINS="https://dsp.niph.org.kh"`).
- `DSP_FRAME_ANCESTORS` space-separated list of origins allowed to embed the notebook in an iframe (e.g. `DSP_FRAME_ANCESTORS="https://dsp.niph.org.kh"`).
JupyterHub is published on host port `443` (configurable via the `JUPYTERHUB_PORT` environment variable in `docker-compose.yml`), so a deployment reachable at `https://dsp.niph.org.kh` works out of the box.
## Project directories shared with containers
| Host directory | Container (app) | Container (Jupyter) |
|-----------------------|-------------------------|------------------------------------|
| `.` (project root) | `/var/www/html` | |
| `r_scripts/` | `/var/www/html/r_scripts` | `/home/jovyan/work/r_scripts` |
| `uploads/jupyter_workspace` | `/var/www/html/uploads/jupyter_workspace` | `/home/jovyan/work` (per-user mount inside spawned notebook) |
Uploads remain writable from the PHP container. If you run into permission warnings on macOS/Linux,
`chmod -R 777 uploads` (or a tighter group-based permission) on the host usually resolves it. The path is bind-mounted into the `dsp_app` container, so ensure permissions are adjusted on the host side.
- Uploaded files are stored under `uploads/datasources/` with names like `datasource_<unique>_<original-stem>.ext`. This keeps paths unique while preserving a readable hint of the original filename. The default PHP upload limit is set to `20M` (see `docker/custom.ini`).
- The `logs/app.log` file (created via `config.php`) records upload activity—if you do not see `[DataSource]` entries after an upload, confirm the app container can reach MySQL (`docker exec dsp_app php -r 'require "config.php"; echo "connected";'`).
## Architecture Overview
```mermaid
graph LR
subgraph Client
U[Browser / API Consumer]
end
subgraph Docker Stack
A[PHP + Apache<br/>dsp_app]
B[(MySQL 8.0<br/>dsp_db)]
C[phpMyAdmin<br/>dsp_phpmyadmin]
D[Jupyter Notebook<br/>dsp_jupyter]
V1[(uploads/datasources)]
V2[(r_scripts)]
end
U -->|HTTPS/HTTP :8082| A
U -->|HTTPS/HTTP :8081| C
U -->|HTTPS :443| D
A <-->|SQL :3306| B
C -->|Admin SQL| B
A -.shared volume .-> V1
A -.shared volume .-> V2
D -.shared volume .-> V1
D -.shared volume .-> V2
```
*Traffic legend:* solid lines represent runtime traffic, dotted lines represent bind-mounted volumes that synchronize datasets and R scripts between containers.
> Need the raw Mermaid for presentations? See `assets/diagrams/data_ecosystem.mmd`.
## Data Model Snapshot
```mermaid
erDiagram
IST_TBL_PEOPLE ||--o{ IST_TBL_USERS : "fkisp_id_of"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE : "fkisp_id_of"
DSPS_TBL_TYPEDATASOURCE ||--o{ DSPS_TBL_DATASOURCE : "fkdspstds_id"
DSPS_TBL_DSPSCATEGORY ||--o{ DSPS_TBL_DATASOURCE : "fkdspscate_id"
DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkdspsds_id"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_PERMISSION : "fkisp_id_of (requester)"
DSPS_TBL_DATASOURCE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkdspsdsused_id"
IST_TBL_PEOPLE ||--o{ DSPS_TBL_DATASOURCE_USED : "fkisp_id_of (consumer)"
```
The diagram highlights how every dataset anchors to a person record, while permissions and usage logs capture cross-person interactions for auditing.
## Analytics Catalog
Analytics scripts live in `r_scripts/` and are exposed through `api/run_r_script.php`. Each script receives two CLI arguments: the absolute path to a CSV prepared by PHP and a JSON string of runtime parameters.
| Script | Purpose | Required Parameters | Optional Parameters | Output |
|--------|---------|---------------------|---------------------|--------|
| `data_summary.R` | Smoke-test script that confirms connectivity between PHP and R, echoing the received file path and parameters. | _None_ | Any JSON payload is echoed back in `params_received`. | JSON with `message`, `data_file`, and the raw parameter string. |
| `descriptive_stats.R` | Generates descriptive statistics for every numeric column (count, mean, median, SD, min, max, missing) and returns up to five preview rows. | _None_ (operates on all numeric columns). | `encoding` (default `UTF-8`), `guess_max` to control type inference. | JSON payload containing `numeric_columns` keyed by column name plus `sample_rows`. Missing values are encoded as `null`. |
| `category_frequency.R` | Builds a frequency distribution for a categorical column. Useful for validating controlled vocabularies or spotting dominant categories. | `column` name of the column to profile. | `top_n` (default `10`), `encoding` (default `UTF-8`), `include_missing` (`false` by default). | JSON with the analyzed column, configuration echo, and `frequencies` (value/count rows) sorted by frequency. |
### Adding another R script
1. Drop the script into `r_scripts/` and ensure it prints JSON via `jsonlite::toJSON(...)`.
2. Append the filename and human-readable label to `$allowed_r_scripts` inside `api/run_r_script.php`.
3. Document the new script in the table above so stakeholders understand its expected parameters and output contract.
## Useful commands
```bash
# Stop and remove containers, keeping the database volume
docker-compose down
# Stop containers and remove the database volume (fresh start)
docker-compose down -v
# Tail logs from all services
docker-compose logs -f
```
## Running Tests
PHPUnit is configured via Composer:
```bash
# Install dependencies (first run)
composer install
# Execute the test suite
composer test
```
If you prefer running inside the app container:
```bash
docker-compose exec app composer install
docker-compose exec app composer test
```
## Troubleshooting
- **MySQL already initialised**: remove the `mysql_data` named volume (`docker-compose down -v`) to force a clean import.
- **Rscript not found**: ensure the PHP container has R installed (`docker-compose build` again). Set `RSCRIPT_PATH` in `docker-compose.yml` if R lives elsewhere.
- **Port clashes**: adjust the published ports (`8082`, `8081`, `443`, `3307`) in `docker-compose.yml` to free ones on your machine.
- **Need the OAuth tables?**: run `docker-compose exec db mysql -u root -p niph_dsps < db/migrations/20241103_oauth_tables.sql` then insert your JupyterHub client credentials.
Happy hacking!