LimaCharlie Documentation Pipeline¶

This directory contains scripts to fetch and process the LimaCharlie documentation from https://docs.limacharlie.io/docs.

Scripts¶

fetch_docs.py¶

Main script to fetch all documentation articles from the LimaCharlie documentation site via the Algolia API.

Features: - Automatically extracts API credentials from the documentation page - Fetches all public, non-deleted, non-draft articles (~612 articles) - Creates directory structure based on article breadcrumbs - Saves articles as markdown with metadata headers - Supports resume capability (skips already downloaded files)

Dependencies:

pip3 install requests beautifulsoup4 html2text

Usage:

# Fetch all documentation
python3 limacharlie/pipeline/fetch_docs.py

# Or make it executable and run directly
chmod +x limacharlie/pipeline/fetch_docs.py
./limacharlie/pipeline/fetch_docs.py

Output: - Articles are saved to ./limacharlie/raw_markdown/ - Directory structure mirrors the breadcrumb hierarchy (e.g., Add-Ons/API Integrations/) - Each file includes a YAML metadata header with title, slug, breadcrumb, source URL, and article ID

test_fetch.py¶

Test script that fetches only the first 3 articles to verify the setup works correctly.

Usage:

python3 limacharlie/pipeline/test_fetch.py

How It Works¶

API Credential Extraction: The script fetches the documentation home page and extracts the Algolia API credentials (app ID, search key, index name) from the page source.
Article Metadata Fetch: Using the Algolia API, it fetches all article metadata including title, slug, breadcrumb, and full text content in a single query.
Filtering: The Algolia API key has built-in filters that automatically exclude:
Deleted articles (isDeleted: true)
Hidden articles (isHidden: true)
Draft articles (isDraft: true)
Excluded articles (exclude: true)
Category entries (isCategory: true)
Unpublished articles (isUnpublished: true)
Processing: For each article:
Creates the directory structure based on breadcrumb
Generates a markdown file with metadata header
Saves the plain text content from Algolia
Error Handling:
Skips articles without content
Supports resume capability (skips existing files)
Comprehensive error logging

Output Format¶

Each markdown file contains:

---
title: Article Title
slug: article-slug
breadcrumb: Category > Subcategory
source: https://docs.limacharlie.io/docs/article-slug
articleId: uuid-here
---

Article content in plain text format...

Statistics¶

As of the last run: - Total entries in Algolia: 680 - Articles (filtered): 612 - Categories (excluded): 68 - Output directory: ./limacharlie/raw_markdown/