100 Days Building a Security Scanner: What Scanning Real Repos Taught Me

One hundred days ago, I shipped the first working version of Data Hogo's security scanner. It was rough — it could detect hardcoded API keys and a handful of dependency issues. Today the scanner runs 250+ Semgrep rules, six parallel engines, and has processed enough real repositories that the patterns are clear. This post is what those patterns actually look like. The security scanner learnings from watching hundreds of developers' code run through analysis, unfiltered.

I'm sharing this partly because building in public is something I believe in, and partly because the data is genuinely interesting. Some findings confirmed what I expected. Others surprised me.

The Numbers: What the Data Shows

Let me be specific. Vague claims about "most repos have issues" don't help anyone.

Across repos scanned with Data Hogo during this period:

~68% of repos with any findings had at least one exposed secret or API key. This is the number that keeps me up at night. More than two-thirds.
Security headers were missing or misconfigured in over 90% of deployed apps. This one I expected — headers are invisible until something breaks.
Dependency vulnerabilities appeared in roughly 55% of scans. Usually medium severity, occasionally critical.
Missing or broken authentication checks appeared in about 30% of repos — but when they appeared, they tended to appear on routes that mattered. API endpoints, admin pages, payment flows.
Average security score on first scan: 43 out of 100. After fixing the top findings, the average jumps to around 71.

The 43-to-71 jump matters. It means most of the risk is concentrated in a small number of fixable issues. You don't have to achieve a perfect score. You have to eliminate the findings that cause real damage.

What I Got Wrong at the Start

Building a security tool comes with a bias: you assume developers are more security-aware than they are. I was wrong about this in the direction of optimism.

I expected secrets to be the rare case — the developer who just forgot. Instead, it's the default outcome when someone builds fast without thinking about it. Cursor generates a Stripe integration. The key goes in the code. The code gets committed. Done.

I also expected that once developers knew about an issue, they'd fix it immediately. The data doesn't support this. Scan results show findings that have been sitting in repos for months. Not because developers are careless — because they're busy, and "rotate the API key" doesn't feel urgent until it's a problem.

What actually motivates fixes is specificity. Not "you have a high severity issue" but "line 47 in src/lib/stripe.ts has a live Stripe key hardcoded." The more specific the finding, the faster it gets fixed. This shaped how Data Hogo presents results: exact file, exact line, exact explanation of why it matters.

The finding that says "you have 3 critical issues" doesn't get fixed. The finding that says "your database password is on line 12 of your .env.production file which is not in your .gitignore" gets fixed within hours.

The Vibe Coding Effect

This was the biggest shift I didn't fully anticipate when I started.

By the time I was three months into building the scanner, it was obvious that AI-generated code was creating a distinct pattern of vulnerabilities. Not new vulnerability classes — the same categories that have always existed. But at higher density and with less review.

When a developer writes code manually, there's a built-in review process: they read what they're writing. When an AI writes code and the developer accepts it, that review often doesn't happen. The result is that patterns that would normally catch a developer's eye — an API key sitting in a string, a route with no auth check — slip through.

Veracode's 2025 State of Software Security report put the number at 45% of AI-generated code containing at least one vulnerability. That's consistent with what we observe. The repos that score worst on first scan are almost always repos that were built primarily with AI assistance and moved fast.

This isn't an argument against AI coding tools. It's an argument for building a review step into your AI-assisted workflow. Running a security scan after a major Cursor or Copilot session is the equivalent of running the test suite: it's the check that catches what you didn't catch in the moment.

The Surprising Findings

Secrets in Unexpected Places

The obvious place for a hardcoded secret is in a source file — a string literal in a config or initialization block. Those we catch easily. What surprised me was how often secrets appear in:

Test files. A developer writes an integration test using a real API key to make it actually work. The test passes. The test gets committed. The key stays there forever.
Configuration templates. A config.example.ts that was supposed to use placeholder values but uses real ones instead.
Old branches that got merged. The secret was removed from main, but it exists in the Git history. git log remembers everything.

The Git history case is the one that catches developers most off-guard. You can delete a file from your current branch. The file and everything in it still exists in every commit that included it. If that commit is in a public repo, the secret is public — even if the file is gone.

# This removes the file from your current branch
git rm --cached secrets.txt
 
# The secret is still accessible in history
git log --all --full-history -- secrets.txt
git show <commit-hash>:secrets.txt

The full recovery process for an exposed key involves rotating the credential, then using a tool like git-filter-repo to scrub the history, then force-pushing. The complete guide to fixing an exposed API key walks through every step.

Small Projects Have Worse Security Than Large Ones

This seems obvious in retrospect, but the magnitude surprised me. Small repos — under 10,000 lines, solo developer or small team, built quickly — score significantly lower on first scan than larger projects.

The reason isn't that solo developers are less skilled. It's that they don't have the process overhead that enforces security: code review, security-focused PRs, automated CI gates. A startup moving fast skips all of that by design. The tradeoff is speed now for security debt later.

The good news: small projects are also easier to fix. The vulnerability surface is smaller. A solo developer can go from first scan to all critical issues resolved in a day.

Security Headers Are Almost Universally Missing

I knew this going in — it's a known problem. What I didn't expect was the percentage. Over 90% of deployed apps we scanned were missing at least one important security header. Many were missing all of them.

The headers that protect against clickjacking, MIME type sniffing, cross-site scripting via content injection, and protocol downgrade attacks are all one-time configuration changes. They don't require ongoing maintenance. They're just not there because no one went looking.

You can check your security headers right now — paste your URL and see which ones are missing. The fix for most of them in a Next.js app is a dozen lines in next.config.ts. The complete Next.js security headers guide covers exactly what to add and why each one matters.

What Building a Security Tool Taught Me About Security

False Positives Are a Trust Tax

Every false positive erodes trust in the tool. If a developer sees a "critical" finding that turns out to be a test file with a fake API key, they'll dismiss the next three real findings out of fatigue. We spent a significant amount of time tuning rules to reduce false positives — specifically around test files, example configs, and template code.

The tradeoff is that a lower false positive rate sometimes means a slightly higher false negative rate. We chose that tradeoff deliberately. A missed finding is a miss. A false positive that causes a developer to ignore all findings is a category failure.

Severity Labels Are Doing a Lot of Work

Early versions of the scanner labeled everything with a CVSS score. Nobody knew what to do with "CVSS 8.1." We switched to plain severity labels (Critical, High, Medium, Low, Info) paired with a one-sentence plain-English description of the actual risk.

"Your Stripe live secret key is committed to this file. Anyone who reads this file can charge cards on your account." That's more actionable than a number.

The First Scan Is a Moment of Truth

The most valuable UX insight from building this: the first scan result is the moment a developer either becomes a user or leaves forever. If the first scan shows zero findings, it's not credible — and the developer doesn't trust the tool. If it shows 47 findings and calls all of them critical, it's overwhelming and the developer gives up. The right first scan result is: here are your 2-3 actual critical issues, here are the medium ones you should know about, and here's your score compared to similar projects.

That's what we built toward. The goal is to make the developer feel like they're looking at an honest assessment from a knowledgeable friend, not a wall of warnings from a liability-covering enterprise tool.

What's Next

The scanner covers JavaScript/TypeScript, Python, Go, Java, PHP, C#, and mobile (React Native, Expo, Flutter) today. The next frontier is infrastructure: cloud config files, Terraform, Kubernetes manifests, IAM policies.

The other direction I'm exploring: integrating the scanner directly into the development loop, not just as a periodic audit. If you're running a Cursor session and shipping a new route, the scan should happen automatically — not wait for you to remember to run it manually.

For now, the free tier is the starting point. Scan your repo free → — 3 scans per month, no credit card required. See your score. See your findings. Then decide what to fix first.

If you want to see exactly what to do after your scan results come in, the post-scan guide covers prioritization — what to fix immediately, what to schedule, and what can wait.

Frequently Asked Questions

What is the most common security vulnerability found in developer repos?

Hardcoded secrets and exposed API keys appear in roughly 60-70% of repositories that have any findings at all. Missing security headers come in second, appearing in nearly every deployed application. Dependency vulnerabilities are third, appearing in about 55% of scans.

How long does it take to build a security scanner?

The core scanning engine for a single language and vulnerability class takes a few weeks. Full coverage — secrets, dependencies, code patterns, configuration, URL headers, database rules — takes months of tuning. The hard part isn't detection. It's reducing false positives enough that developers trust the results.

Are developers getting better at security over time?

The overall trend is positive. Frameworks have more secure defaults, major platforms add security guardrails, and awareness has increased. But AI-generated code has introduced a regression: tools that optimize for working code generate insecure patterns at scale. The volume of new vulnerabilities is keeping pace with improvements in security awareness.

What security checks matter most for a small project?

In order of impact: rotate any exposed secrets immediately, update dependencies with known CVEs, add authentication checks to all routes handling user data, and set security headers on your deployed app. Those four categories account for most exploitable vulnerabilities in small projects.

Should I scan my repo even if it's just a side project?

Yes — especially if it has real users, processes payments, or stores user data. Side projects have the worst security track record precisely because they get built fast without a review process. A free scan takes under 5 minutes.

Scan your repo free →