Source Scrapers
Horizon fetches content from four source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.
Hacker News
File: src/scrapers/hackernews.py
Uses the Firebase HN API:
GET /topstories.jsonβ fetches top story IDsGET /item/{id}.jsonβ fetches story/comment details
Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).
Config (sources.hackernews):
{
"enabled": true,
"fetch_top_stories": 30,
"min_score": 100
}
fetch_top_storiesβ number of top story IDs to fetchmin_scoreβ minimum HN points to include a story
Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.
GitHub
File: src/scrapers/github.py
Uses the GitHub REST API:
GET /users/{username}/events/publicβ user activity eventsGET /repos/{owner}/{repo}/releasesβ repository releases
Two source types are supported:
user_eventsβ tracks push, create, release, public, and watch events for a userrepo_releasesβ tracks new releases for a specific repository
Config (sources.github, list of entries):
{
"type": "user_events",
"username": "torvalds",
"enabled": true
}
{
"type": "repo_releases",
"owner": "golang",
"repo": "go",
"enabled": true
}
Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).
RSS
File: src/scrapers/rss.py
Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.
Config (sources.rss, list of entries):
{
"name": "Simon Willison",
"url": "https://simonwillison.net/atom/everything/",
"enabled": true,
"category": "ai-tools"
}
categoryβ optional tag for grouping (e.g.,"programming","microblog")
Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.
File: src/scrapers/reddit.py
Uses Redditβs public JSON API (www.reddit.com):
GET /r/{subreddit}/{sort}.jsonβ subreddit postsGET /user/{username}/submitted.jsonβ user submissionsGET /r/{subreddit}/comments/{post_id}.jsonβ post comments
Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.
Config (sources.reddit):
{
"enabled": true,
"fetch_comments": 5,
"subreddits": [
{
"subreddit": "MachineLearning",
"sort": "hot",
"fetch_limit": 25,
"min_score": 10
}
],
"users": [
{
"username": "spez",
"sort": "new",
"fetch_limit": 10
}
]
}
sortβhot,new,top, orrising(subreddits);hotornew(users)time_filterβ fortop/risingsorts:hour,day,week,month,year,allmin_scoreβ minimum post score (subreddits only)
Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Redditβs API guidelines.
Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.
File: src/scrapers/twitter.py
Uses the Apify platform to bypass Twitterβs anti-scraping measures. The actor altimis/scweet is called via the Apify REST API.
Flow:
- POST to
/v2/acts/{actor_id}/runsto trigger a run - Poll
/v2/actor-runs/{run_id}until status isSUCCEEDEDor a terminal failure - GET
/v2/datasets/{dataset_id}/itemsto retrieve results
Config (sources.twitter):
{
"enabled": true,
"users": ["karpathy", "ylecun"],
"fetch_limit": 10,
"fetch_reply_text": false,
"max_replies_per_tweet": 3,
"max_tweets_to_expand": 10,
"reply_min_likes": 5,
"actor_id": "altimis~scweet",
"apify_token_env": "APIFY_TOKEN"
}
usersβ Twitter screen names to monitor, without the@prefixfetch_limitβ maximum tweets to fetch per runfetch_reply_textβ whentrue, a second Apify run fetches reply bodies for each important tweet and appends them under--- Top Comments ---for AI analysismax_replies_per_tweetβ maximum reply lines per tweet (sorted by engagement score)max_tweets_to_expandβ cap on reply expansion runs per pipeline cycle, to control Apify credit usagereply_min_likesβ minimum likes required for a reply to be includedactor_idβ Apify actor ID (default:altimis~scweet)apify_token_envβ environment variable name containing the Apify API token
Authentication: Set APIFY_TOKEN in your .env. Get a token at console.apify.com.
Extracted data: tweet text, URL, author, publish time, likes, retweets, replies, views, and (optionally) reply-thread text appended under --- Top Comments ---.