
Most cybersecurity conversations about exposure focus on breaches, ransomware, or stolen credentials. Those are real threats, but many organizations overlook another category of risk: files that were never hacked at all, but which were simply made public or left publuc by mistake.
That was the focus of my AtlSecCon 2026 presentation: how attackers, investigators, and defenders discover public files that probably should not be public, and how organizations can reduce that risk.
For generalist cybersecurity practitioners, OSINT professionals, municipalities, nonprofits, and small-town businesses, this topic matters because exposure often happens quietly and without any alarms.
What Counts as a File That “Probably Shouldn’t Be Public”?
Good question, thanks for asking!
This can include all sorts of things, such as internal reports, draft documents, confidential PDFs, backup archives, exported databases, logs, development files, onboarding documents, invoices, HR records, misplaced uploads, and even source code. It also includes cloud storage contents such as public buckets or shared containers that were intended for internal use only.
Sometimes the issue is obvious, like a database backup in a public folder. Other times it is subtle, such as an archived staging site, a forgotten document indexed by search engines, or an old repository commit containing secrets.
Why This Matters
Publicly exposed files often reveal names, email addresses, vendors, software versions, directory structures, internal processes, and security tooling. A single document can create multiple new reconnaissance paths, and expand your attack surface.
They also make you look like you don’t know what you’re doing, or—putting on my special hat which allows me to talk to enterprise clients—they create reputational risk. Customers and partners may (reasonably) wonder what else is unmanaged.
Finally, unmanaged public assets often signal weak governance. Attackers frequently interpret visible disorder as an invitation to look deeper, like a lion zeroing in on an injured gazelle.
Security Through Obscurity Does Not Work
A common misconception is that if something is not linked from the homepage, it is hidden.
In reality, unlinked content can still be found through search engines, sitemap files, robots.txt entries, JavaScript references, archives, certificate logs, passive DNS, and infrastructure indexing tools.
If a file is reachable and poorly controlled, it may effectively be public.
Your Online Presence Is Usually Bigger Than You Think
Many organizations think in terms of “our website.” In practice, they often have a much broader digital footprint that extends well beyond a single domain. This can include multiple domains, numerous subdomains, cloud storage assets, SaaS platforms, staging environments, legacy vendor sites, public repositories, campaign landing pages, archived web content, and shared resources. Each of these assets can represent part of the organization’s online presence, and each may introduce its own visibility, security, and governance considerations.
That means the attack surface is rarely one clean perimeter. It is an ecosystem built over time.
How Public Files Get Found
Search Engines
Google remains one of the most effective reconnaissance tools available. Advanced search operators can reveal indexed files, PDFs, forgotten pages, and subdomains.
Useful concepts include:
site:example.comfiletype:pdf-inurl:www- quoted phrases
- excluded keywords
Google also publishes documentation on indexable file types. And as I love to say, RTFM! (Read The Fantastic Manual) 😉
Additional dorking resources mentioned in the presentation:
- DorkSearch: https://dorksearch.com/
- No Nonsense Intel Adverse Media Search Tool: https://www.no-nonsense-intel.com/adverse-media-search-tool
- OneDorkForAll: https://github.com/HackShiv/OneDorkForAll/tree/main/dorks
- Gh0st D0rk Killer: https://github.com/theGh0stfaceKiller/Gh0st_D0rk_Killer
- Deep Dork Web: https://guilherme-moraiss.github.io/Deep-Dork-Web/
- DorkTerm: https://yogsec.github.io/DorkTerm/
- OSINT-CSE: https://github.com/paulpogoda/OSINT-CSE
- One-Liner-OSINT: https://github.com/yogsec/One-Liner-OSINT
- Dorks Collections List: https://github.com/cipher387/Dorks-collections-list
Alternative Search Engines
Google is powerful, but different engines surface different results. Depending on language, geography, or platform type, alternate engines may provide better visibility. For example, DuckDuckGo is much better than Google at serving things like Telegram channels or illicit forums, which can be useful for cyber threat intelligence, and search engines such as Baidu and Yandex provide better coverage of regional results for China and Russia respectively.
Useful resources:
- eTools Meta Search: https://www.etools.ch/
- AllTheInternet: https://www.alltheinternet.com/
- IntelTechniques Search Tools: https://inteltechniques.com/tools/Search.html
- Search Engines With Own Indexes: https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
- Search Engine Colossus: https://www.searchenginecolossus.com/
- Search Start Page 1: https://start.me/p/ekl8eK/search-engines
- Search Start Page 2: https://start.me/p/b56G5Q/search-engines
- SearchTweaks: https://searchtweaks.com/
File Metadata
Finding public files is an iterative process; it doesn’t end when you find something, as that something might very well be your next lead. Always look through the file content to see where else it might lead you… but don’t sleep on the file metadata! Documents often contain hidden metadata such as:
- Author names
- Usernames
- Software versions
- Timestamps
- Original filenames
- Internal file paths
These details can create new pivots and help map internal environments.
Useful tools:
- ExifTool: https://exiftool.org/
- Metagoofil: commonly available through OSINT tooling repositories and package managers
Infrastructure Search Engines
Search engines such as Google focus on regular web pages, i.e. as HTTP(S) protocol, but thinking about the internet more broadly, you can use specialized search engines to find things such as…
- Open ports and services
- TLS certificates
- Domain relationships
- Technology stacks
- Hosting patterns
- Reused analytics IDs
- Historical internet-facing assets
Examples mentioned in the talk:
- Shodan: https://www.shodan.io/
- Censys: https://search.censys.io/
- PublicWWW: https://publicwww.com/
- crt.sh: https://crt.sh/
- WhoXY: https://www.whoxy.com/
- ZoomEye: https://www.zoomeye.org/
- FOFA: https://fofa.info/
- LeakIX: https://leakix.net/
- Netlas: https://netlas.io/
- Onyphe https://search.onyphe.io/
- GreyNoise: https://www.greynoise.io/
- BuiltWith: https://builtwith.com/
- Wappalyzer: https://www.wappalyzer.com/
- DNSDumpster: https://dnsdumpster.com/
- SecurityTrails: https://securitytrails.com/
- ViewDNS: https://viewdns.info/
- URLScan: https://urlscan.io/
Open Directories and File Servers
Misconfigured FTP servers, open web directories, and exposed file shares can also provide immediate access to sensitive content. Resources from the presentation:
FTP Indexers
- SearchFTPS: https://www.searchftps.net/
- Mamont FTP Index: https://www.mmnt.net/
- Freeware FTP Search: http://www.freewareweb.com/ftpsearch.shtml
Open Directory Discovery
- Open Directory Finder: https://ewasion.github.io/opendirectory-finder/
- ODCrawler: https://odcrawler.xyz/
- EyeDex: https://www.eyedex.org/
Cloud Storage Buckets
Public S3 buckets, Azure Blob containers, and similar storage services continue to expose:
- Backups
- Archives
- Logs
- Internal datasets
- Static production assets
Common causes include public-read permissions, weak access control, predictable names, and poor governance.
Bucket discovery resources:
- Open Buckets: https://openbuckets.io/
- GreyHat Warfare: https://buckets.grayhatwarfare.com/
- OSINT.Sh Buckets: https://osint.sh/buckets/
- SOCRadar BlueBleed: https://socradar.io/labs/bluebleed/
Other tooling mentioned:
- AWSBucketDump
- S3Scanner
- CloudBrute
Public Code Repositories
GitHub, GitLab, and Bitbucket are valuable collaboration platforms, but they also preserve history.
Even if sensitive content is removed today, it may still exist in old commits. Organizations should review not only what is public now, but what was public before.
Web Archives
Archive platforms preserve previous versions of websites, removed documents, old scripts, and historical infrastructure clues. These services are useful for defenders because they show what may still be visible after cleanup. I actually wrote another post about Internet archive sites here.
Paste Sites and Code Search
Temporary convenience often becomes long-term exposure. This may include:
- Troubleshooting snippets
- Logs pasted externally
- Config files shared for support
- Credentials
- API keys
- Internal URLs
- Sensitive copied data
Resources mentioned:
- grep.app: https://grep.app/
- RedHunt Labs Online IDE Search (as shown in notes)
URL Shorteners
Shortened links can expose destination paths, collaboration links, marketing structure, forms, and shared resources that rely on obscurity. Things like bit.ly, t.co, tinyurl, etc.
If shortened URLs are indexed or searchable, they can become discovery sources. The best tools I’ve seen for this is GreyHat Warfare’s URL shortener search.
What Defenders Should Do
Build an Accurate Asset Inventory
Yes! 100% … You should know every domain, subdomain, cloud bucket, repository, vendor-managed property, and public-facing service tied to your organization, and have assigned ownership/responsibility for it. Here’s how to get your asset inventory started.
Conduct Regular Exposure Reviews
Search for your own company the way an outsider would. Review indexed files, archives, buckets, repositories, and certificates.
Remove Unintentional Assets
Every public file, page, folder, and service should exist for a reason. If it has no purpose, retire it. This does not apply to honeypots; those do serve a purpose… but having a bunch of random assets will not slow down hackers or allow you to observe their TTPs!
Govern Third Parties
Many exposures originate through contractors, agencies, or legacy vendors. Ownership must be explicit, and terms should be clear in your service level agreeements.
Think in Terms of Discoverability and Control
If something is discoverable, ask what control exists around it. If the answer is unclear, investigate.
Final Thoughts
Many organizations prepare for dramatic cyberattacks while overlooking quieter risks already sitting in public view. Assume anything online can be found, then build your security program accordingly.
