Editorial illustration for Complete Real-World Example Shows Crawl4AI CSS Extraction and Filtering
Crawl4AI: CSS Scraping and Content Filtering Demystified
Complete Real-World Example Shows Crawl4AI CSS Extraction and Filtering
Crawl4AI has been moving from isolated snippets to end‑to‑end pipelines that actually scrape, clean and structure data. The latest code drop stitches together three of its core capabilities: CSS‑based element selection, a simple content filter and a schema that tells a language model how to label the results. What makes the example worth a second look is its choice of target—Hacker News, a site whose front page changes every few minutes and whose markup mixes headlines, timestamps and discussion links.
By wiring a markdown generator and a JavaScript executor into the same flow, the script shows how an LLM can be prompted to produce a tidy, machine‑readable output without manual post‑processing. Readers familiar with earlier tutorials will notice the shift from “just fetch HTML” to a more disciplined extraction that respects both visual cues (via CSS) and semantic intent (via the schema). The following snippet marks the start of that demonstration, complete with console banners and a brief description of what the function does.
🌟 PART 16: COMPLETE REAL‑WORLD EXAMPLE ============================================================
async def complete_example(): """Complete example combining CSS extraction with content filtering.""" print("\n🌟 Running complete example: Hacker News scraper with filtering") schema = { "name": "HN Stories", "baseSel
print("\n" + "="*60) print("🌟 PART 16: COMPLETE REAL-WORLD EXAMPLE") print("="*60) async def complete_example(): """Complete example combining CSS extraction with content filtering.""" print("\n🌟 Running complete example: Hacker News scraper with filtering") schema = { "name": "HN Stories", "baseSelector": "tr.athing", "fields": [ {"name": "rank", "selector": "span.rank", "type": "text"}, {"name": "title", "selector": "span.titleline > a", "type": "text"}, {"name": "url", "selector": "span.titleline > a", "type": "attribute", "attribute": "href"}, {"name": "site", "selector": "span.sitestr", "type": "text"} ] } browser_config = BrowserConfig( headless=True, viewport_width=1920, viewport_height=1080 ) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, extraction_strategy=JsonCssExtractionStrategy(schema), markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.4) ) ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://news.ycombinator.com", config=run_config ) if result.extracted_content: stories = json.loads(result.extracted_content) print(f"\n✅ Successfully extracted {len(stories)} stories!") print(f"\n{'='*70}") print("📰 TOP HACKER NEWS STORIES") print("="*70) for story in stories[:15]: rank = story.get('rank', '?').strip('.') if story.get('rank') else '?' title = story.get('title', 'No title')[:50] site = story.get('site', 'N/A') url = story.get('url', '')[:30] print(f" #{rank:<3} {title:<50} ({site})") print("="*70) return stories return [] stories = asyncio.run(complete_example()) print("\n" + "="*60) print("💾 BONUS: SAVING RESULTS") print("="*60) if stories: with open('hacker_news_stories.json', 'w') as f: json.dump(stories, f, indent=2) print(f"✅ Saved {len(stories)} stories to 'hacker_news_stories.json'") print("\nTo download in Colab:") print(" from google.colab import files") print(" files.download('hacker_news_stories.json')") print("\n" + "="*60) print("📚 TUTORIAL COMPLETE!") print("="*60) print(""" ✅ What you learned: 1. Complete real-world scraping example 📖 RESOURCES: • Docs: https://docs.crawl4ai.com/ • GitHub: https://github.com/unclecode/crawl4ai • Discord: https://discord.gg/jP8KfhDhyN 🚀 Happy Crawling with Crawl4AI!
The tutorial stitches together a full Crawl4AI pipeline, showing that modern crawlers can do more than fetch raw HTML. By configuring a headless browser, the example runs JavaScript, captures screenshots, and extracts content using CSS selectors that feed into an LLM‑driven schema. Markdown generation turns the raw data into readable reports, while session handling and link analysis enable deeper, multi‑page traversals.
Concurrent crawling demonstrates that the framework can scale across several URLs, and the Hacker News scraper illustrates a real‑world use case with filtering logic built into the extraction step. However, the guide stops short of benchmarking performance or assessing robustness under heavy traffic, leaving open questions about latency and resource consumption. It's unclear whether the same setup would handle sites with aggressive anti‑scraping measures without additional tweaks.
Can this approach survive real‑world load? Overall, the material provides a concrete reference for developers interested in extending Crawl4AI, but practical deployment will likely require further testing and adaptation.
Further Reading
- AI Web Scraping Without Limits—Scrape Anything using Crawl4AI - Dev.to
- Crawl4AI in Action: Real-World Use Cases for Smarter Web Scraping - mcavdar.com
- Crawl4AI - a hands-on guide to AI-friendly web crawling - ScrapingBee
- Content Selection - Crawl4AI Documentation (v0.8.x) - Crawl4AI Documentation
Common Questions Answered
How does Crawl4AI extract content from dynamic websites like Hacker News?
Crawl4AI uses CSS-based element selection to precisely target specific HTML elements on dynamic pages. The framework configures a headless browser to run JavaScript and capture content, allowing it to extract data from frequently changing sites like Hacker News.
What key components are included in the Crawl4AI scraping pipeline?
The Crawl4AI pipeline combines three core capabilities: CSS-based element selection, content filtering, and a structured schema for labeling extracted data. This approach allows for precise content extraction, filtering of relevant information, and structured output that can be easily processed by language models.
What makes the Hacker News scraping example unique in the Crawl4AI tutorial?
The Hacker News example demonstrates Crawl4AI's ability to handle dynamic, frequently changing web content with complex markup. By using specific CSS selectors like 'tr.athing' and extracting fields such as rank and title, the tutorial shows how the framework can reliably extract structured data from challenging web sources.