Before diving into the architecture, here is a plain-language explanation of every tool and service used in this plan — what it is, what problem it solves, and what role it plays specifically in the NetShield pipeline.
raw_data and social_data_alfa tables here. The DS team reads from here. No more "the DB is on my laptop."raw_data, dbt runs SQL models that produce the final social_data_alfa tables used by the DS team.trendact.github.io/-elt/. Anyone with the link can view it — no Flask server or Cloudflare tunnel needed.The pipeline is split across three machines with no shared infrastructure. Every daily run requires manual coordination between three people.
| Pipeline | Owner | What it does | Output | Problem |
|---|---|---|---|---|
| Data Collection | Machine A | Scrapes TikTok (by keyword + WhatsApp links), produces NDJSON files | NDJSON files on local disk | Manual Files must be physically sent / committed |
| ELT Pipeline (this repo) | Machine B (yours) | Parses NDJSON → loads to local PostgreSQL → dbt → HTML report | Tables in social_data_alfa schema, HTML report |
Manual DB is local, nobody else can read it |
| DS Pipeline | Machine C | Reads processed tables, trains / runs models | Model outputs, analysis | Blocked Cannot access local DB on Machine B. Repo: Trendact/-ai-pipeline |
The goal is a single automated daily pipeline where:
git push → everything else runs automatically → DS team queries cloud DB with fresh data by morning.
| Component | Service | Free Limit | Cost | Status |
|---|---|---|---|---|
| Orchestration (Airflow replacement) | GitHub Actions | 2,000 min/month (private repo) — our daily run ≈ 5–10 min → ~210 min/month | FREE | Well within free tier |
| Cloud PostgreSQL | Neon Pro | 10 GB storage, unlimited connections, always-on option | $19/mo | Free tier is not viable at 100k rows/load (~60 MB/day → free tier exhausted in 8 days). Neon Pro (10 GB) required from day one. 10 GB lasts ~5.5 months before needing archival. See Section 6 for full breakdown. |
| Report hosting | GitHub Pages | 1 GB storage, 100 GB bandwidth/month | FREE | HTML report committed and served as a static page |
| Source code | GitHub Repos | Unlimited repositories | FREE | ELT pipeline code. NDJSON data files must NOT be committed to git at this scale — see Section 7. |
| NDJSON data file transfer | GitHub Releases (artifacts) | 2 GB per file, unlimited releases on public repos | FREE | 100k rows ≈ 50–100 MB per NDJSON file. Github Releases handle this cleanly. Data team uploads a release asset, Actions workflow downloads it. See Section 7. |
| Secrets management | GitHub Actions Secrets | Unlimited encrypted secrets per repo/org | FREE | DB connection string, passwords stored as secrets |
| dbt | dbt-core (self-hosted) | Open source, runs inside Actions runner | FREE | Already in this repo |
| Python environment | GitHub Actions runner (ubuntu-latest) | Python 3.x pre-installed on every runner | FREE | pip install from requirements.txt |
| Actual Apache Airflow | Astro (managed Airflow) | Free tier: 1 deployment, limited tasks | ~$0 but restricted | Optional upgrade if DAG complexity grows |
GitHub Actions runs a YAML workflow file on a schedule. Each job = one Airflow task. Jobs can depend on each other, pass data between steps, and send notifications on failure — everything Airflow does for this use case.
# .github/workflows/daily_pipeline.yml
name: NetShield Daily Pipeline
on:
schedule:
- cron: '0 6 * * *' # runs every day at 06:00 UTC
workflow_dispatch: # also allows manual trigger from GitHub UI
repository_dispatch: # triggered by data collection team's push
types: [data-ready]
jobs:
load-and-transform:
runs-on: ubuntu-latest
steps:
- name: Checkout ELT repo
uses: actions/checkout@v4
- name: Checkout data collection repo
uses: actions/checkout@v4
with:
repository: Trendact/data-collection # data team's repo
path: data_intake
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.13'
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run pipeline (load + dbt + report)
env:
DB_HOST: ${{ secrets.NEON_HOST }}
DB_USER: ${{ secrets.NEON_USER }}
DB_PASSWORD: ${{ secrets.NEON_PASSWORD }}
DB_NAME: ${{ secrets.NEON_DB }}
run: python run_pipeline.py
- name: Publish report to GitHub Pages
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./reports
| Step | Estimated Time |
|---|---|
| Checkout + install deps | ~2 min |
| Download NDJSON from GitHub Release (~100 MB) | ~1 min |
| Load 100k rows to PostgreSQL | ~10–15 min (bulk COPY: ~5 min) |
| dbt run (more rows = more model compute) | ~2–3 min |
| Generate report | ~1 min |
| Publish to Pages | <1 min |
| Total per run | ~17–23 min (optimised: ~10 min with bulk COPY) |
| Total per month (30 runs) | ~510–690 min (of 2,000 free) ✓ |
COPY (bulk insert) instead of row-by-row INSERT for the loading step. This can cut load time from 15 min to ~2–3 min for 100k rows.
Neon is a serverless PostgreSQL provider with a generous free tier — fully compatible with psycopg2, dbt-postgres, and SQLAlchemy. No credit card required to sign up.
The local PostgreSQL database was measured before recommending a storage tier. Here are the real numbers:
| Schema | Table | Size (total incl. indexes) | Notes |
|---|---|---|---|
| raw_data | tiktok_posts_meta | 2.3 MB | Largest table — 2,967 rows |
| raw_data | tiktok_script_out | 968 kB | |
| social_data_alfa | hashtags | 1.1 MB | |
| social_data_alfa | posts | 1.0 MB | 2,898 rows |
| social_data_alfa | post_metadata | 960 kB | |
| All other tables combined | ~1.5 MB | ||
| Total user data (raw_data + social_data_alfa) | ~8 MB | Full DB including system tables: 18 MB | |
Average storage per row across all tables: ~0.6 KB. Target load: 100,000 rows per load = ~60 MB per daily run.
| Rows per load | Daily DB growth | Free tier (0.5 GB) lasts | Neon Pro (10 GB) lasts | Verdict |
|---|---|---|---|---|
| 500 rows | ~0.3 MB/day | ~4.5 years | N/A | ✓ Free tier OK |
| 10,000 rows | ~6 MB/day | ~3 months | ~4.5 years | ⚠ Upgrade soon |
| 100,000 rows ← TARGET | ~60 MB/day | ✗ ~8 days | ~5.5 months | Neon Pro required from day 1 |
| 100,000 rows + archiving old data | Net ~10–20 MB/day | N/A | 1–2 years | ✓ With archival strategy |
social_data_alfa tables live. This extends Neon Pro to 1–2 years of operation.
Rather than storing every raw NDJSON row forever, the pipeline should periodically archive old raw_data tables to compressed files in GitHub Releases and truncate the source tables. The social_data_alfa transformed tables are much smaller and can be kept live indefinitely.
| Provider | Free storage | Free connections | Expiry | Credit card? |
|---|---|---|---|---|
| Neon Pro ✓ recommended | 10 GB | Unlimited | Never | No |
| Neon Free | 0.5 GB | Unlimited | Never | No (but exhausted in 8 days at 100k rows) |
| Supabase | 500 MB | Unlimited | Never (with activity) | No |
| Railway | 1 GB | Unlimited | 500 hrs/month (sleeps) | No |
| Render | 1 GB | 97 connections | 90 days then deleted | No |
After creating the Neon project, replace your local connection string:
# .env (local machine — never committed)
DB_HOST=ep-xxxxx.us-east-2.aws.neon.tech
DB_PORT=5432
DB_NAME=neondb
DB_USER=netshield_user
DB_PASSWORD=xxxxxxxxxxxxxxxx
DB_SSLMODE=require # required for Neon
# GitHub Actions secrets (set in repo Settings → Secrets)
# NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB
The db_config.py and profiles.yml files already read from environment variables — only the values change, not the code logic.
The weakest link in the current flow is how NDJSON files get from the data collection machine to the ELT pipeline. There are two options:
The data collection team uploads their NDJSON file as a GitHub Release asset to the Trendact/data-intake repo (up to 2 GB per file, free on public repos). Creating the release triggers a release event, which fires the ELT workflow automatically — no polling, no waiting for cron.
# In data-intake repo — .github/workflows/notify.yml
on:
release:
types: [published] # fires when data team publishes a new release
jobs:
notify-elt:
runs-on: ubuntu-latest
steps:
- name: Trigger ELT pipeline with release URL
run: |
curl -X POST \
-H "Authorization: token ${{ secrets.PAT_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/Trendact/-elt/dispatches \
-d '{"event_type":"data-ready","client_payload":{"asset_url":"${{ github.event.release.assets[0].browser_download_url }}"}}'
The ELT workflow receives the direct download URL of the NDJSON file as a payload, downloads it with curl inside the Actions runner, and processes it. No large files ever enter a git commit.
All teams commit to the same repo under separate folders (data_intake/, elt/). Simpler for a small team, but mixes concerns and makes the repo large over time as NDJSON files accumulate.
Trendact/-ai-pipeline)After migration, the DS team updates their connection string in the Trendact/-ai-pipeline repo to point to Neon. No other changes are required on their side:
# DS pipeline — Python
import psycopg2
conn = psycopg2.connect(
host="ep-xxxxx.us-east-2.aws.neon.tech",
dbname="neondb", user="netshield_user",
password="xxxxxx", sslmode="require"
)
No VPN, no IP whitelisting, no file transfers needed. Neon supports connections from any IP by default.
Go to neon.tech → sign up free → create project → copy connection string. Create two roles: one for the pipeline (read + write), one for the DS team (read-only).
CREATE ROLE ds_reader LOGIN PASSWORD 'xxx'; GRANT SELECT ON ALL TABLES IN SCHEMA social_data_alfa TO ds_reader;
Dump the local PostgreSQL schema + data and restore to Neon:
pg_dump -h localhost -U postgres postgres | psql "postgres://netshield_user:xxx@ep-xxx.neon.tech/neondb?sslmode=require"
Update your local .env and ~/.dbt/profiles.yml with the Neon connection details. Run run_pipeline.py locally once to verify dbt + load all work against the cloud DB.
In the GitHub repo → Settings → Secrets and variables → Actions → add: NEON_HOST, NEON_USER, NEON_PASSWORD, NEON_DB. These replace the .env file inside the Actions runner.
The Actions runner has no .venv. Export your environment:
.venv\Scripts\pip freeze > requirements.txt then commit it. The workflow will install from this file.
Create .github/workflows/daily_pipeline.yml using the template in Section 5. Commit and push. The Actions tab in GitHub will show the workflow — trigger it manually once to verify end-to-end in the cloud.
In repo Settings → Pages → Source: gh-pages branch. After the first successful workflow run, the HTML report is published at https://trendact.github.io/-elt/ — accessible to anyone with the URL.
Create Trendact/data-intake repo. Data collection team clones it and adds the notify workflow (Section 7). When they finish a scrape, they create a GitHub Release and upload the NDJSON file as a release asset (via gh release create CLI or GitHub UI). This fires the ELT pipeline automatically — no files ever committed to git.
Give the DS team (Trendact/-ai-pipeline) the read-only role credentials. They update their connection string in their own repo. Verify they can SELECT from social_data_alfa.
Let the cron run automatically. Check the Actions log for errors. Common issues: missing Python package in requirements.txt, dbt profile path differences between Windows local and Linux runner. Patch and commit.
| Item | Service | Monthly Cost | Notes |
|---|---|---|---|
| Orchestration (pipeline scheduling + running) | GitHub Actions | $0 | ~180 of 2,000 free minutes used |
| Cloud PostgreSQL | Neon | $0 | 0.5 GB free, auto-suspends when idle = no compute charges |
| Report hosting | GitHub Pages | $0 | Static HTML, no Flask server needed |
| Code + data file storage | GitHub Repos | $0 | NDJSON files are small; well within free limit |
| Secrets management | GitHub Actions Secrets | $0 | Encrypted at rest, injected at runtime |
| Domain / SSL for report | GitHub Pages (included) | $0 | HTTPS enforced by default on github.io URLs |
| TOTAL | $19 / month | Only Neon Pro requires payment. Everything else remains free. |
| Bottleneck | Trigger | Upgrade | Cost |
|---|---|---|---|
| DB storage | >10 GB on Neon Pro (after ~5.5 months with no archiving) | Add archival job to pipeline, or Neon Business plan | $0 (archival) or $69/mo (50 GB) |
| Pipeline complexity grows | Need real DAG retries, branching, visual monitoring | Astro (managed Airflow) | $0–$30/mo |
| Actions minutes | >2,000 min/month (loading 100k rows may take 15–20 min/run) | Self-hosted runner on a cheap VM | $4–$6/mo |
| Private report URL | Need password protection | Re-enable Flask + Cloudflare (already built) | $0 |
| Limitation | Impact | Severity | Solution when ready |
|---|---|---|---|
| GitHub Actions is not real Airflow | No visual DAG UI, no automatic retry with backoff, no SLA alerting | LOW for now | Migrate to Astro or self-hosted Airflow when DAG complexity grows |
| Neon auto-suspends after 5 min idle | Only on free tier — not an issue on Neon Pro (always-on) | N/A on Pro | Neon Pro is always-on. No cold start delay. |
| Report on GitHub Pages is public | Anyone with the URL can read the report | MEDIUM | Keep Flask + Cloudflare for sensitive sharing; or use a private GitHub Pages (requires GitHub Team plan) |
| NDJSON files in git repo | 100k rows ≈ 50–100 MB per file — hits GitHub's 100 MB hard limit, breaks git push | BLOCKER — do not commit data files | Use GitHub Releases as the data drop zone (Section 7). ELT workflow downloads the asset URL directly. No data files ever enter a git commit. |
| dbt profile uses Linux path in Actions | profiles.yml path differs between Windows and Linux runner | ACTION NEEDED | Pass profiles dir via --profiles-dir . flag in workflow, keep a profiles.yml in the ELT folder for CI |