Finding Public Files That Probably Shouldn’t Be Public

Most cybersecurity conversations about exposure focus on breaches, ransomware, or stolen credentials. Those are real threats, but many organizations overlook another category of risk: files that were never hacked at all, but which were simply made public or left publuc by mistake.

That was the focus of my AtlSecCon 2026 presentation: how attackers, investigators, and defenders discover public files that probably should not be public, and how organizations can reduce that risk.

For generalist cybersecurity practitioners, OSINT professionals, municipalities, nonprofits, and small-town businesses, this topic matters because exposure often happens quietly and without any alarms.

What Counts as a File That “Probably Shouldn’t Be Public”?

Good question, thanks for asking!

This can include all sorts of things, such as internal reports, draft documents, confidential PDFs, backup archives, exported databases, logs, development files, onboarding documents, invoices, HR records, misplaced uploads, and even source code. It also includes cloud storage contents such as public buckets or shared containers that were intended for internal use only.

Sometimes the issue is obvious, like a database backup in a public folder. Other times it is subtle, such as an archived staging site, a forgotten document indexed by search engines, or an old repository commit containing secrets.

Why This Matters

Publicly exposed files often reveal names, email addresses, vendors, software versions, directory structures, internal processes, and security tooling. A single document can create multiple new reconnaissance paths, and expand your attack surface.

They also make you look like you don’t know what you’re doing, or—putting on my special hat which allows me to talk to enterprise clients—they create reputational risk. Customers and partners may (reasonably) wonder what else is unmanaged.

Finally, unmanaged public assets often signal weak governance. Attackers frequently interpret visible disorder as an invitation to look deeper, like a lion zeroing in on an injured gazelle.

Security Through Obscurity Does Not Work

A common misconception is that if something is not linked from the homepage, it is hidden.

In reality, unlinked content can still be found through search engines, sitemap files, robots.txt entries, JavaScript references, archives, certificate logs, passive DNS, and infrastructure indexing tools.

If a file is reachable and poorly controlled, it may effectively be public.

Your Online Presence Is Usually Bigger Than You Think

Many organizations think in terms of “our website.” In practice, they often have a much broader digital footprint that extends well beyond a single domain. This can include multiple domains, numerous subdomains, cloud storage assets, SaaS platforms, staging environments, legacy vendor sites, public repositories, campaign landing pages, archived web content, and shared resources. Each of these assets can represent part of the organization’s online presence, and each may introduce its own visibility, security, and governance considerations.

That means the attack surface is rarely one clean perimeter. It is an ecosystem built over time.

How Public Files Get Found

Search Engines

Google remains one of the most effective reconnaissance tools available. Advanced search operators can reveal indexed files, PDFs, forgotten pages, and subdomains.

Useful concepts include:

site:example.com
filetype:pdf
-inurl:www
quoted phrases
excluded keywords

Google also publishes documentation on indexable file types. And as I love to say, RTFM! (Read The Fantastic Manual) 😉

Additional dorking resources mentioned in the presentation:

DorkSearch: https://dorksearch.com/
No Nonsense Intel Adverse Media Search Tool: https://www.no-nonsense-intel.com/adverse-media-search-tool
OneDorkForAll: https://github.com/HackShiv/OneDorkForAll/tree/main/dorks
Gh0st D0rk Killer: https://github.com/theGh0stfaceKiller/Gh0st_D0rk_Killer
Deep Dork Web: https://guilherme-moraiss.github.io/Deep-Dork-Web/
DorkTerm: https://yogsec.github.io/DorkTerm/
OSINT-CSE: https://github.com/paulpogoda/OSINT-CSE
One-Liner-OSINT: https://github.com/yogsec/One-Liner-OSINT
Dorks Collections List: https://github.com/cipher387/Dorks-collections-list

Alternative Search Engines

Google is powerful, but different engines surface different results. Depending on language, geography, or platform type, alternate engines may provide better visibility. For example, DuckDuckGo is much better than Google at serving things like Telegram channels or illicit forums, which can be useful for cyber threat intelligence, and search engines such as Baidu and Yandex provide better coverage of regional results for China and Russia respectively.

Useful resources:

eTools Meta Search: https://www.etools.ch/
AllTheInternet: https://www.alltheinternet.com/
IntelTechniques Search Tools: https://inteltechniques.com/tools/Search.html
Search Engines With Own Indexes: https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
Search Engine Colossus: https://www.searchenginecolossus.com/
Search Start Page 1: https://start.me/p/ekl8eK/search-engines
Search Start Page 2: https://start.me/p/b56G5Q/search-engines
SearchTweaks: https://searchtweaks.com/

File Metadata

Finding public files is an iterative process; it doesn’t end when you find something, as that something might very well be your next lead. Always look through the file content to see where else it might lead you… but don’t sleep on the file metadata! Documents often contain hidden metadata such as:

Author names
Usernames
Software versions
Timestamps
Original filenames
Internal file paths

These details can create new pivots and help map internal environments.

Useful tools:

ExifTool: https://exiftool.org/
Metagoofil: commonly available through OSINT tooling repositories and package managers

Infrastructure Search Engines

Search engines such as Google focus on regular web pages, i.e. as HTTP(S) protocol, but thinking about the internet more broadly, you can use specialized search engines to find things such as…

Open ports and services
TLS certificates
Domain relationships
Technology stacks
Hosting patterns
Reused analytics IDs
Historical internet-facing assets

Examples mentioned in the talk:

Shodan: https://www.shodan.io/
Censys: https://search.censys.io/
PublicWWW: https://publicwww.com/
crt.sh: https://crt.sh/
WhoXY: https://www.whoxy.com/
ZoomEye: https://www.zoomeye.org/
FOFA: https://fofa.info/
LeakIX: https://leakix.net/
Netlas: https://netlas.io/
Onyphe https://search.onyphe.io/
GreyNoise: https://www.greynoise.io/
BuiltWith: https://builtwith.com/
Wappalyzer: https://www.wappalyzer.com/
DNSDumpster: https://dnsdumpster.com/
SecurityTrails: https://securitytrails.com/
ViewDNS: https://viewdns.info/
URLScan: https://urlscan.io/

Open Directories and File Servers

Misconfigured FTP servers, open web directories, and exposed file shares can also provide immediate access to sensitive content. Resources from the presentation:

FTP Indexers

SearchFTPS: https://www.searchftps.net/
Mamont FTP Index: https://www.mmnt.net/
Freeware FTP Search: http://www.freewareweb.com/ftpsearch.shtml

Open Directory Discovery

Open Directory Finder: https://ewasion.github.io/opendirectory-finder/
ODCrawler: https://odcrawler.xyz/
EyeDex: https://www.eyedex.org/

Cloud Storage Buckets

Public S3 buckets, Azure Blob containers, and similar storage services continue to expose:

Backups
Archives
Logs
Internal datasets
Static production assets

Common causes include public-read permissions, weak access control, predictable names, and poor governance.

Bucket discovery resources:

Open Buckets: https://openbuckets.io/
GreyHat Warfare: https://buckets.grayhatwarfare.com/
OSINT.Sh Buckets: https://osint.sh/buckets/
SOCRadar BlueBleed: https://socradar.io/labs/bluebleed/

Other tooling mentioned:

AWSBucketDump
S3Scanner
CloudBrute

Public Code Repositories

GitHub, GitLab, and Bitbucket are valuable collaboration platforms, but they also preserve history.

Even if sensitive content is removed today, it may still exist in old commits. Organizations should review not only what is public now, but what was public before.

Web Archives

Archive platforms preserve previous versions of websites, removed documents, old scripts, and historical infrastructure clues. These services are useful for defenders because they show what may still be visible after cleanup. I actually wrote another post about Internet archive sites here.

Paste Sites and Code Search

Temporary convenience often becomes long-term exposure. This may include:

Troubleshooting snippets
Logs pasted externally
Config files shared for support
Credentials
API keys
Internal URLs
Sensitive copied data

Resources mentioned:

grep.app: https://grep.app/
RedHunt Labs Online IDE Search (as shown in notes)

URL Shorteners

Shortened links can expose destination paths, collaboration links, marketing structure, forms, and shared resources that rely on obscurity. Things like bit.ly, t.co, tinyurl, etc.

If shortened URLs are indexed or searchable, they can become discovery sources. The best tools I’ve seen for this is GreyHat Warfare’s URL shortener search.

What Defenders Should Do

Build an Accurate Asset Inventory

Yes! 100% … You should know every domain, subdomain, cloud bucket, repository, vendor-managed property, and public-facing service tied to your organization, and have assigned ownership/responsibility for it. Here’s how to get your asset inventory started.

Conduct Regular Exposure Reviews

Search for your own company the way an outsider would. Review indexed files, archives, buckets, repositories, and certificates.

Remove Unintentional Assets

Every public file, page, folder, and service should exist for a reason. If it has no purpose, retire it. This does not apply to honeypots; those do serve a purpose… but having a bunch of random assets will not slow down hackers or allow you to observe their TTPs!

Govern Third Parties

Many exposures originate through contractors, agencies, or legacy vendors. Ownership must be explicit, and terms should be clear in your service level agreeements.

Think in Terms of Discoverability and Control

If something is discoverable, ask what control exists around it. If the answer is unclear, investigate.

Final Thoughts

Many organizations prepare for dramatic cyberattacks while overlooking quieter risks already sitting in public view. Assume anything online can be found, then build your security program accordingly.