How to create a simple robots.txt file for your website
robots.txt files tell search engines what they are and aren’t allowed to crawl and index on your Website, by using the robots.txt file properly we can control the way our websites are searched and indexed.
All major search engines check to see if a robots.txt file is present to adhere to any special rules you may have for them, spam bots and the like may ignore this file.
Benefits of using a robots.txt file
- Can prevent duplicate content from being indexed
- Saves bandwidth and keeps logs clean
- Prevents server from continually having to send 404 file not found errors to search engines when they request the robots.txt file each visit
- Prevent system folders and irrelevant files from showing up in Google etc
- Can prevent confidential documents or images from being indexed
- Increases the importance of pages that are allowed to be indexed
- Can prevent users finding restricted pages of your website via search engines
How to create and upload a robots.txt file
- Create a blank notepad/word document called robots and save as plain text (*.txt) (Most word processing programs should have the same functionality)
- Write or copy in the code you want, and save.
- Upload the robots.txt file to the root directory of your Website via FTP
(Usually this is the public_html or www folder) and you’re done.
Allowing Robots Access to Everything
User-agent: * Disallow:
This instructs search engines and robots that they can crawl and index all content.
Denying Robots Access to Everything
User-agent: * Disallow: /
This instructs search engines and robots that they may not crawl or index any content.
Basic Template for a simple robots file
User-agent: * Disallow: /cgi-bin/ Disallow: /*.js$ Disallow: /*.css$ # Google Image User-agent: Googlebot-Image Disallow: Allow: /* # Google AdSense User-agent: Mediapartners-Google* Disallow: Allow: /*
How Robot file rules work
User-agent is basically a synonym for "Search engines and robots called:"
and, * means anything, so in this case, we are targeting all User-agents.
Disallow: /cgi-bin/ Disallow: /*.js$ Disallow: /*.css$
The Disallow parameter is used to indicate locations that may not be accessed.
A Disallow with /on both ends/ means that folder and all its subfolders and url segments with that in it, may not be indexed.
Disallow /*.css$ means that any file of that type/extension may not be indexed, this could be used for any file type, for example you could prevent confidential pdf’s (.pdf) from being accessed by Google or images (.jpeg, .gif) etc.
# Google Image User-agent: Googlebot-Image Disallow: Allow: /*
Using # at the start of a line is called commenting, this helps us humans make notes to remember what the hell we did, and tells robots to ignore this line.
Here we are targeting the Googlebot-Image User-agent specifically, to let it know that it may access and index all images with Allow: /* and saying that nothing is specifically forbidden by leaving Disallow blank, note that not all robots support Allow.
The reason we specifically target the image bot and adSense bot after we target all User-agents is to make sure that they are not being prevented by a previous rule, for example you could prevent everything in the /news/ section of your website from being indexed, text, files, images – but then you want images to still be indexed, and for adSense to still work.
Referencing a sitemap in your simple robots file
If you have a sitemap you can inform the search engines where it is here, it’s generally a good idea to keep your sitemap in the root directory of your website. [Create a sitemap]
Denying access to search result pages
Disallow: /*?* Disallow: /*?
By preventing urls that contain question marks from being indexed we can stop search results from being indexed, and let the robots focus on your content/products.
Content Management Systems And Robots
Other content management systems may also have plugins or addons that can help generate or maintain robots files, however because of the way each CMS is structured it’s not possible to make a robots file that can cater for everything.
Common Mistakes & Things To Remember
- The robots file is case sensitive, and the file name is all lowercase: robots.txt
- The file name is all lowercase: robots.txt
- Keep to one rule per line
- Old rules may be overwritten by newer rules declared after them
- Anyone may view your robots.txt file, so be careful of what you include
Are there any other great robot.txt rules that you use on your websites?