When a crawler visits a web site, such as
http://www.yourshophere.com/,
it firsts checks for
http://www.yourshophere.com/robots.txt.
If the document exists, it analyzes
the contents to see what pages on the site it is allowed to index. You can customize the
robots.txt file to apply only to specific robots, and to disallow access to specific directories
or files.
User-agent: * # applies to all robots Disallow: / #
disallow indexing of all pages
The
robot looks for a /robots.txt
URI on your site, where a site is defined as
an HTTP server running on a particular host and port number. There can only be a single
/robots.txt
on a site. If you want to create a robots.txt for one or more sites individually, you can use Business Manager to create the file. This robots.txt file is served to any requesting crawlers from the application server. It's stored as a site preference and can be replicated from one instance to another.
If you want to create a single robots.txt file that can be used for multiple sites, you can use Google's Webmaster Tools to create this file. However, you must have created a Google account to do so. If you choose not to use Google, you can use other third-party tools to create this file. This file must be uploaded to your cartridge after you create it. You must also invalidate the Static Content Cache for a new or different robots.txt file to be generated or served.
User-agent
field per
record. The robot should be liberal in interpreting this field.Disallow: /help
disallows both /help.html and /help/index.html, whereas Disallow: /help/
would disallow /help/index.html but allow /help.html. An empty value
for Disallow, indicates that all URIs can be retrieved.Disallow
field must be present in
the robots.txt file.Allow: /
isn't valid and will be
ignored.Before creating a robots.txt file, it is important to understand how the Storefront Password Protection settings for your site affect what can be crawled. If Storefront Password Protection is enabled, a robots.txt file is automatically generated and denies access to all static resources for a site. If Storefront Password Protection is disabled, the robots.txt file determines whether content is crawled. Because Storefront Password Protection automatically generates a robots.txt file, it must be disabled before you can specify another type of robots.txt file.
Use the robots.txt file from a deployed cartridge: Use Google Webmaster Tools or another third party to generate your robots.txt file. Add the file to a cartridge on your site path. There can only be one robots.txt file per site. If you want to generate a robots.txt file using another tool and upload it to your cartridge. This option is most useful if you want to use the same robots.txt file for multiple sites.
This is not recommended, because usually you want to have different settings for different instance types. For example, you don't want your sandbox or staging sites to be crawled, but you do want your production sites to be crawled. This can cause issues when replicating code to production. This option is only selected before a site goes live to test the robots.txt file.
robots.txt
file.robots.txt
file by downloading the file or copying the contents to a text file
and saving it as robots.txt.For information on where to upload your robots.txt file, see Uploading Your Robots.txt File.
If caching isn't enabled on your Staging site, any changes to the robots.txt file are immediately detected. However, if caching is enabled for your Staging instance, you must invalidate the Static Content Cache for a new or different robots.txt file to be generated or served. This requires permissions in the Administration module. The following instructions include information on invalidating the cache, though you might not need them if you don't have caching enabled.
If you already have entries in the robots.txt, add the following directives at the bottom.
# Search refinement URL parameters
Disallow: /*pmin*
Disallow: /*pmax*
Disallow: /*prefn1*
Disallow: /*prefn2*
Disallow: /*prefn3*
Disallow: /*prefn4*
Disallow: /*prefv1*
Disallow: /*prefv2*
Disallow: /*prefv3*
Disallow: /*prefv4*
Disallow: /*srule*
Set the Googlebot crawl rate to Low through the Google Webmaster tools, as Google ignores crawl-delay directive in robots.txt, outlined in https://support.google.com/webmasters/answer/48620?hl=en.