GRC directors and security architects in regulated industries know the rhythm. Audit season rolls around. Evidence requests land on desks. A sample gets pulled. A clean report ships. The badge goes on the website. And everyone moves forward, trusting the process.
But the process has a math problem. And the gap between what the audit actually verified and what the certification claims to cover keeps growing, especially as AI-assisted development compresses sprint cycles and multiplies the volume of changes flowing through your environment every week.
In this post we will cover:
- How Does Sampling Work in a SOC 2 Audit?
- Where the Logic Breaks Down for Security
- Pen Testing Has the Same Blind Spot
- Why Continuous Compliance Doesn’t Fix the Gap Either
- The Speed Problem Makes Everything Worse
- What Change-Native Governance Gets Right
- Make the Unverified Surface Visible
How Does Sampling Work in a SOC 2 Audit?
Sampling isn’t a shortcut. It’s a formal methodology with real statistical foundations, and auditors apply it carefully.
In SOC 1 and SOC 2 examinations, auditors use one of four primary sampling methods: simple random sampling, systematic sampling, haphazard sampling, or block sampling. The AICPA’s Audit Sampling Guide provides recommended sample sizes based on population size, control risk, and acceptable deviation rates. As Linford & Company explains in their overview of audit sampling methods, the tables aim for a minimum 90% confidence level with allowances for up to two deviations.
For a population (number of employees) of 250 or more, the AICPA tables typically recommend a sample of around 25 to 40 items. The auditor examines each sampled item, tests whether the control operated as designed, and draws a conclusion about the full population based on the results.
The logic is straightforward. If you pull a well-designed sample and find zero deviations, you can reasonably conclude the control operated effectively across the population. Auditors do this responsibly, following AU-C 530 guidance on sample design, size, and selection.
And for many types of controls, the methodology holds up well. Annual controls get tested at 100% (sample of one). Automated controls often require only a single reperformance test, because the control executes identically every time. The sampling challenge shows up most acutely with manual, frequently-occurring controls, and that’s exactly the category where modern development velocity creates the biggest strain.
Where the Logic Breaks Down for Security
Sampling works best when the population being tested behaves consistently. In database marketing, Jim Novo describes the mechanics of random sampling for testing customer promotions: you select a random percentage of your target group, run the test, and extrapolate the results. The method works because customer behavior within a well-defined segment follows predictable patterns. A 3% sample gives you a reasonable read. A 10% sample gives you strong predictive stability. And the key requirement, as Novo emphasizes, is that the sample must be truly random and free from bias, or the results become unreliable.
But security controls don’t behave like customer segments or marketing campaigns. A firewall rule operates nothing like an access review, which operates nothing like an encryption configuration. Each control type has unique failure modes, dependencies, and environmental context. When an auditor samples 25 pull requests out of 36,000 and treats the result as representative of the whole, the underlying assumption (that those 25 items are statistically similar to the other 35,975) doesn’t hold.
As Ayoub Fandi put it in a recent GRC Engineer Newsletter, the methodology was borrowed from domains where population homogeneity makes sampling valid. Security environments violate that requirement completely. An attacker doesn’t need a pattern across your controls. They need one misconfigured resource, one unpatched dependency, or one forgotten test environment with production credentials. None of which will show up in a sample of 25 unless you get lucky. According to Ayoub:
“Let me put concrete numbers on this. Say your engineering team ships code 100 times a day. Over a 12-month SOC 2 reporting period, that is roughly 36,000 pull requests. Your auditor samples 25 of them.
That is a sample rate of 0.07%.
The same 25 gets pulled whether the population is 200 or 200,000. The joiners, movers, and leavers process? Sampled. Specific endpoints? Sampled. Merge requests? Sampled. The number 25 does not change. The population it represents does.
In statistical science, a sample of 25 from a population of 36,000 is not statistically significant for heterogeneous populations. But we certify the whole company against it.”
And the certification doesn’t say “we’re 90% confident that 25 of your controls worked.” The certification implies comprehensive assurance across your entire scoped environment. Your customers see the badge. They don’t see the sampling memo.
Pen Testing Has the Same Blind Spot
Auditing isn’t the only security discipline that runs on partial coverage. Penetration testing operates on a similar principle, and it creates a similar false confidence.
Every pen test starts with a scoping exercise. Budgets, timelines, and team availability determine how many IP addresses, applications, or network segments get tested. The scope is always a fraction of the full environment.
A pen tester working through 800 IP addresses in a two-week engagement can’t go deep on every single one. Some get thorough testing. Others get a surface scan. And anything outside the agreed scope doesn’t get touched at all, even if it’s connected to systems that were tested.
The difference between pen testing and GRC auditing is that pen testers typically acknowledge the limitation upfront. Nobody in the pen testing community pretends that a scoped engagement equals full coverage. The report says “we tested X within Y boundaries,” and that’s understood as a constraint.
GRC auditing doesn’t operate with that same transparency. The output of a sample-based audit is a certification that implies comprehensive assurance. Your customers don’t see the sampling memo. They see the badge.
Both approaches share a structural weakness: partial observation gets treated as sufficient evidence of security. But only GRC stamps a certification on the result and calls it done.
Why Continuous Compliance Doesn’t Fix the Gap
Modern GRC platforms like Vanta and Drata have made real progress on one piece of the puzzle. They connect to your cloud infrastructure, identity providers, and endpoint management tools to monitor control configurations continuously. If your S3 buckets lose encryption, or your MFA policy changes, or an access review falls behind schedule, the platform flags it.
And that’s genuinely valuable for configuration-level compliance. Controls that emit telemetry (like infrastructure settings and access policies) can be monitored at full population, all the time. No sampling needed.
But continuous compliance platforms monitor whether controls exist and are configured correctly. They answer the question: “Are we compliant right now?”
They don’t answer a different question, and it’s the one that matters most for security leaders managing fast-moving development organizations: “Will the change my team shipped this morning make us non-compliant tomorrow?”
When an engineering team introduces a new feature that collects PII, modifies authentication logic, or integrates a third-party API, the risk isn’t about whether existing controls are configured correctly. The risk is about whether anyone assessed the new change for security, privacy, and compliance implications before it hit production.
Continuous compliance platforms can’t read a Jira ticket. They can’t parse a design document. They can’t analyze a pull request and determine that a developer modified token validation logic in a way that bypasses an existing security control. They don’t correlate a design doc, a set of tickets, and a group of PRs into a single initiative and assess the aggregate risk.
So the sampling problem persists, even with continuous monitoring in place. Your infrastructure controls are watched around the clock. But the thousands of code changes, architecture decisions, and vendor integrations flowing through your development lifecycle every month still get reviewed (if at all) through the same manual, sample-based process they always did.
The outer ring of your attack surface, the one created by new changes, shadow IT, and cloud resources provisioned without GRC involvement, keeps growing. And neither sampling nor continuous monitoring covers it.
The Speed Problem Makes Everything Worse
The math gets more uncomfortable when you factor in how fast the threat landscape moves.
According to data tracked by zerodayclock.com (compiled by Sergej Epp from CISA KEV, VulnCheck KEV, and XDB sources covering 3,520 CVE-exploit pairs), the mean time-to-exploit for vulnerabilities has collapsed from 2.3 years in 2018 to 1.3 days in 2026. Nearly 70% of exploited vulnerabilities in 2026 were weaponized on or before the day they were publicly disclosed.
Your quarterly audit observation window was designed for a world where threats moved in months. Exploits now move in hours.
And development velocity has accelerated in parallel. AI coding tools like Copilot and Cursor have compressed sprint cycles and multiplied PR volume. Security teams that were already struggling with 1:100 security-to-developer ratios are now facing backlogs of 50 to 100+ pending reviews, with 7-day average cycle times for security assessments.
When your auditor samples 25 items from a population that grew by 30% since last year’s audit, and the threats targeting that population move at machine speed, the gap between “certified” and “secure” widens into territory that no confidence interval can cover.
What Change-Native Governance Gets Right
Fixing the sampling problem doesn’t mean taking bigger samples. Sampling 50 instead of 25 barely moves the needle when the population is 36,000 and growing.
The alternative is building governance into the process where changes actually happen. Instead of reviewing a sample of pull requests after the fact, you build a system that assesses every change for security, privacy, and compliance risk as it occurs.
Gist Security built its platform around exactly this model. The system connects to Jira, ServiceNow, GitHub, GitLab, Confluence, and other tools across the development stack. It automatically scores every change request on five dimensions: whether security should review it, whether security should approve it, whether work should be suspended, relevance, and urgency. It then tests each change against uploaded company policies and industry frameworks like NIST, OWASP, GDPR, and HIPAA.
For high-risk changes (authentication modifications, new PII handling, vendor integrations), Gist runs automated threat modeling and architecture review, produces findings and required mitigations, and routes decisions to the appropriate approver. Every risk decision, approval condition, exception, and proof of control gets captured into a Decision and Evidence Ledger that generates audit-ready packages on demand.
The difference from traditional GRC is fundamental. Evidence doesn’t get collected retroactively during audit prep. It comes from real work artifacts as they’re created. The question “who approved this design, and why?” has an answer before the auditor asks it.
For GRC directors tired of annual evidence scrambles and security architects drowning in review backlogs, the shift means the auditable perimeter actually matches the real environment, not a curated subset of it.
Make the Unverified Surface Visible
If you’re a security leader reading this and recognizing the gap in your own program, the first move is making the unverified surface visible to your leadership team.
Ask your auditor for the sample selection details. Get the actual numbers: how many items were tested out of how many total in the population. Map your real control population (every system, service, cloud resource, and code change) against what was actually in scope. Calculate the delta and present it as residual unverified risk. Because that’s exactly what it is, and no badge covers it.
Then pick one control category and move it from sampling to continuous, full-population verification. See what you find when the inner ring of verified coverage begins expanding toward the outer ring of your actual attack surface. The tooling exists. The economics work. Your auditor checked a fraction of a percent. Your attacker will check the rest.
FAQ
Audit sampling is a methodology where auditors test a subset of the total control population and extrapolate conclusions about the whole. The AICPA Audit Sampling Guide recommends specific sample sizes based on population, risk, and acceptable deviations. For populations of 250 or more, a sample of 25 to 40 items is typical. While the methodology follows valid statistical principles, it means the vast majority of a company’s control activity goes unexamined.
Sampling relies on the assumption that the population being tested behaves consistently, so a representative sample tells you about the whole. Security controls are fundamentally heterogeneous. A firewall rule operates nothing like an access review or an encryption configuration. Each has unique failure modes, and an attacker only needs one unexamined gap. The statistical assumption of population consistency doesn’t hold in security environments.
Pen testing also operates on scoped, partial coverage driven by budgets and timelines. The key difference is transparency. Pen testers acknowledge coverage limits in their reports. GRC auditing issues certifications that imply comprehensive assurance even when only a fraction of the environment was examined. Gist Security addresses both gaps by continuously monitoring every change across the full development lifecycle, rather than relying on periodic, partial observation.
Continuous compliance platforms monitor whether existing controls are configured correctly, like checking that S3 buckets stay encrypted or MFA policies remain active. But they can’t assess whether a new code change, architecture decision, or vendor integration introduces risk. The sampling gap for code changes, design decisions, and IT initiatives persists because those platforms can’t read Jira tickets, parse design docs, or correlate changes into initiatives. Change-native GRC fills that gap.
Change-native GRC embeds governance directly into the software development lifecycle. Instead of reviewing a sample of changes after the fact, platforms like Gist Security assess every change request for security, privacy, and compliance risk as it happens. Every decision, exception, and proof of control gets captured into an audit-ready evidence ledger automatically, eliminating the retroactive scramble that traditional audit cycles require.
Request your auditor’s sample selection memo and compare the number of items tested against the total control population. The delta between “tested” and “total” represents unverified residual risk. For engineering-heavy organizations shipping hundreds of changes daily, the unverified surface can exceed 99% of the total population. Presenting that number to leadership makes the sampling gap concrete and actionable.