October 4, 2023

Content Marketing

Disallowing ChatGPT Using Robots.txt

Author

Chris Young

AI tools like ChatGPT, while useful, can inadvertently bypass content attribution, raising concerns about intellectual property. This article offers instructions to protect proprietary content using "robots.txt".

ChatGPT and other natural-language processing (NLP) AI have, seemingly overnight, become a central part of everyday life. While this is a good thing for those who need quick access to information in a conversational manner, it could be damaging to content producers wary of plagiarism or intellectual property theft. In order to gain its capabilities, ChatGPT was exposed to a massive data set that includes a number of different websites and information sources. Much of this information would otherwise be proprietary. When human writers quote proprietary information, ethics require them to cite their sources and attribute credit. However, ChatGPT is under no such compulsion and passes off the information it provides without attributing credit to the source. This is where the problems appear–the authors and publishers of thought leadership and innovative new content don’t get the credit they deserve. Fortunately, there is an easy way to prevent this from happening in your company.

The Basics of ChatGPT

ChatGPT is a language-based artificial intelligence model developed by OpenAI based on the GPT (Generative Pre-trained Transformer) architecture. This technology rose to prominence in the second half of 2022 and has been improving its knowledge, eloquence, and abilities since then. Its primary capabilities include natural language understanding, generation, and the ability to answer questions and engage in conversation with users. The knowledge that GPT acquired was derived from expansive datasets it was exposed to, encompassing vast portions of the internet. These varied datasets enable GPT to discuss information on a wide range of topics–everything from general conversational capabilities to answering theological and philosophical questions. However, these data sets are not updated in real-time. The data are selected beforehand to train ChatGPT and then, only on later versions, are they updated to include exposure to new information.

Data Sets

ChatGPT's knowledge base is constructed from the data it was trained on. This means its responses are influenced by the information present in those datasets, making it knowledgeable on a vast array of subjects up to its last training cut-off in 2022.

Proprietary Information

ChatGPT does not have the ability to access subscription sites, confidential information, or proprietary databases. Any information it provides is based on publicly available data up to its last training date. However, this publicly available information, such as newspaper articles, blogs, and more, does constitute ideas original to other people. ChatGPT will not be able to cite these exact sources which could lead to improper representation, contextual issues, and perhaps most importantly, lack of citation.

The Power and Purpose of Robots.txt

“Robots.txt” is a standard utilized by websites to instruct web crawlers, spiders, and other web “robots” about which pages or files they should or shouldn't request from the site. Located at the root directory of a website (typically "website.com/robots.txt"), this simple text file acts as a protocol that provides guidelines about which parts of the site should remain private or not be indexed by search engines.

Its role is crucial in safeguarding specific content from automatic processing by ethical web crawlers. Website owners might have sections of their site that are private, under development, or not intended for public view. By specifying disallowance directives in the “robots.txt” file, they can request that web crawlers not access and index those sections. For example, a line like "Disallow: /private/" instructs web robots not to access or index anything under the "private" directory.

However, it's essential to note that “robots.txt” is a voluntary mechanism. Ethical web crawlers, such as those used by major search engines, will respect the directives, but it's technically possible for malicious bots or scrapers to ignore the file and access the content anyway. As such, for highly sensitive data, more stringent security measures than just a "robots.txt" file are necessary.

Why Disallow ChatGPT?

The main reason that websites may want to disallow ChatGPT from using their websites as datasets is because of the nature of content marketing. Content marketing relies on the idea that companies must first provide something of value (usually knowledge or entertainment) as the first mover. In turn, the company earns the trust of the customer as well as their respect. This usually happens through articles, white papers, eBooks, and other pieces of proprietary information that are published for consumption by the public at large.

There would be no issues here if ChatGPT actually cited its sources, but this is not a feature that OpenAI (an oxymoron that Elon Musk poignantly said should change its name to “ClosedAI”) offers. If they offered this it would disclose their data set and allow other organizations to essentially re-produce ChatGPT. This damages content marketing efforts by failing to credit businesses with the content that they produce and ultimately disincentivizes investing in thought leadership and other content. Furthermore, this will lead to less-educated consumers who will not be able to access information if companies no longer produce it, leading to a net negative.

Beyond intellectual property, there is some concern about server resources. Each interaction a web crawler makes with a website entails a server request. While a single request from a bot might seem inconsequential, the cumulative effect of numerous simultaneous or rapid interactions can be significant. Excessive requests can bog down a server, leading to reduced website performance, heightened server maintenance costs, and, in some extreme cases, outright crashes. This can not only inconvenience genuine users but also inflate operational costs for site administrators.

Step-by-step Guide to Disallowing ChatGPT using Robots.txt

Step 1

To disallow ChatGPT from scraping data from your website, begin with accessing your "robots.txt" file. For those using traditional web hosting setups, you'll want to log in to your web hosting control panel or use an FTP client. Once logged in, navigate to your website's root directory, which is typically named "public_html", "www", or "htdocs". Here, you should look for a file named "robots.txt". If this file exists, you can edit it directly. If it's absent, you'll need to create a new one. Alternatively, if you operate your website through a Content Management System like WordPress, you might have the convenience of tools or plugins that enable editing the "robots.txt" directly from your admin dashboard. Make sure you consult your CMS's documentation or plugin repository for specifics on how to do this.

Step 2

Having accessed or created the "robots.txt", the next step is to specifically instruct crawlers like ChatGPT to abstain from scanning your content. To accomplish this, you'll want to append the following lines to your file:

User-agent: ChatGPT

Disallow: /

Here, the "User-agent" line targets a specific crawler, in this case, ChatGPT. The subsequent "Disallow" directive, followed by a slash, indicates the prohibition of the entire website. If your intention is merely to restrict a specific directory, replace the slash with the directory's path, such as "/directory-name/".

Step 3

After embedding these directives, it's imperative to verify that they're functioning as intended. One of the most reliable methods to achieve this is by employing Google's Robots.txt Testing Tool. Run the testing tool to ensure that your new additions have been picked up correctly.

Potential Challenges & Solutions

There are several challenges that companies must deal with when considering ChatGPT and NLPAI as well. While ChatGPT is the most well-known program, it certainly is not the only one–ChatGPT exists in a constellation of other AI programs that may crawl your website even if ChatGPT is disallowed. If you cut the head off of one AI program, 10 other ones are waiting to fill its place. This “hydra challenge” is essentially just a fact of life that internet users must get accustomed to.

Other dangers may come with editing a robots.txt file that may affect other web crawlers that you would want to crawl your website. These bots, especially ones from major search engines like Google or Bing, play a pivotal role in indexing a website and determining its position in search results. Accidentally blocking them could have detrimental effects on a website's Search Engine Optimization (SEO), rendering the site virtually invisible in search results and leading to decreased organic traffic. In some instances, blocking certain bots can also impact site functionality, especially if the website relies on external services that use crawlers for data updating or synchronization.

There are, however, a few paths forward and at least one aspect of consolation. Most people who have heard of ChatGPT believe that it is merely a novelty. They may play around with it for a few minutes and think that it’s interesting… and then never return to use the program. In other words, they will continue to use other sources (Like Google or DuckDuckGo) to find information that is willing to credit your website and send people there to read more.

For anyone who has spent a meaningful amount of time using ChatGPT, you will have noticed that it routinely produces bad information. After a dubious response, all you have to do is ask ChatGPT, “Are you sure about that?” It will frequently respond with an apology and a retracted or re-written statement as seen below. The main advantage that traditional search engines have is their democratization of information. Their entire algorithms are based on finding the content that is most suitable for a given query and then allowing the searcher to decide what best fulfills their needs or answers their questions. ChatGPT is fundamentally different in that no such algorithm is used to prioritize better information or deprioritize bad information.

Perhaps one of the best ways to overcome this deficit in attribution is by social and legal pressure on the AI companies in question. There is a real argument to be made that their activities fall outside of “fair use” and cross the boundary into plagiarism. Pressuring these companies into citing sources will not only make the internet experience better for everyone, but it will provide more incentives for businesses to continue to produce content that helps their customer base.

Conclusion

Natural language processing AI is a fact of life and comes with advantages as well as disadvantages. ChatGPT can contribute to disseminating general knowledge and educating more people. However, disallowing ChatGPT using robot.txt is a step that every company should take in its endeavor to protect its published intellectual property and marketing efforts. If you’re unsure of how exactly to implement the robot.txt changes or if you need continued help with content creation and marketing, consider contacting Be More Digital. Our team of experts has decades of experience working across a wide array of different industries. Our multi-pronged marketing approaches are extremely effective and can be adjusted to several different business models. Get in touch with Be More Digital today for a comprehensive proposal and consultation.

Custom Digital Marketing Solution

We understand that every business is different and may need a combination of solutions or only a small subset of solutions. Get in touch with our team today to see how Be More Can help you reach your goals.