CONTENT DISCOVERY

March 23, 2023

Content discovery within the context of web application security is the act of finding content such as files, pages and features that aren't obviously apparent or aren't intended to be publicly available. Content discovery can be achieved via manual discovery, open source intelligence (OSINT) or by using automated tools.

Manual Discovery is a method of finding content by exploring a web app manually. Some examples of places to look manually are:

Robots.txt
Sitemap.xml
HTML Headers

Robots.txt is a file that web applications have to specify which parts of the website should be included or excluded as part of search engine results. It also Specifies which search engines are allowed to show results for their website. This can be a useful file as a penetration tester to have access to as it can provide a list of pages that were not intended to be viewed.

In the below example, the robots.txt file showed a "/staff-portal" page which could be of interest.

Sitemap.xml is a file that displays all the essential pages to ensure search engines can crawl them and show them in their search results. It can be used as a penetration tester to access older or harder to reach parts of a web app.

HTML Headers Can be viewed by using curl http://[IP] -v, this will display the source code of a site and any HTML headers which can contain meta data, such as webserver software and any back end programming language.

Favicon is the small icon next to the URL in a browsers address bar. If the developers of the site have not edited it, it may still be the favicon from the sites framework. This can be checked using curl http://[adress of favicon] | md5sum and the resulting output checked against the OWASP favicon database. With the information about the framework the web app is built on, it may open up new avenues for exploitation.

Open Source Intelligence (OSINT) is a way of gathering information that is publicly available. Some examples of avenues to explore when looking for content are:

Google hacking/dorking - Using google operators to help search for content that may be otherwise difficult to find.
Wappalyzer - is an online tool for identifying what technologies were used to build a site.
Wayback Machine - Is a site that archives other websites historical versions.
GitHub - A version control system which allows users to share updates to software. It can be useful for finding source code or information that may not be intended to be publicly available.
S3 Buckets - A storgae device provided by Amazon that may be left public.

Automated Tools such as dirb, gobuster and ffuf are tools that are designed to quickly test a website for pages according to a wordlist provided by the user. If there are matches for these pages within a web app, the user will be told which matches have been made. It can be useful for quickly searching for common pages that aren't necessarily intended for public use.

Search This Blog

Jonathan Winter

CONTENT DISCOVERY

Comments

Post a Comment

Popular posts from this blog

BURPSUITE IN-DEPTH

PASSIVE AND ACTIVE RECONNAISSANCE

CROSS-SITE SCRIPTING