Watermarking AI-Generated Text and Code
Exploring innovative algorithms for embedding metadata in AI-generated text and code.
Input: "Write a cat story"
Secret Key: af23b9c4d5e6f708192a3b4c5d6e7f809
Metadata: "Bobllama2"
Output: "Once upon a time, there lived a feline..."
[Metadata detected: "Bobllama2"]
Input: "Check if list has close numbers"
Secret Key: af23b9c4d5e6f708192a3b4c5d6e7f809
Metadata: "Bobllama2"
Output: "def close_pair(nums, thresh): ..."
[Metadata detected: "Bobllama2"]
About InvisibleID
Invisible Watermarking for AI-Generated Text & Code
InvisibleID develops invisible, multi-bit watermarking technology for AI-generated text and code, enabling organizations to verify provenance, enforce compliance, and protect intellectual property without degrading output quality.
As generative AI becomes widely used in enterprise workflows, it is increasingly difficult to audit AI-generated content, detect hallucinations, and protect critical document sections from unauthorized modification. InvisibleID addresses this gap by embedding verifiable metadata directly into AI outputs during generation, allowing authorized parties to later verify origin and integrity.
Our Team
- Massieh Kordi Boroujeny LLM Watermarking
- Kai Zeng Cybersecurity
- Ya Jiang Research
Watermarking Algorithms
Provisional patents have been filed for the following watermarking algorithms.
Algorithm 1: StealthInk
Multi-bit undetectable watermarking that embeds multiple bits of information into AI-generated text. Achieves high detection accuracy with minimal perceptible changes to the generated content.
- Multi-bit information embedding
- Secret key-based detection
- High detection accuracy
Algorithm 2: DISC with Random Initialization
Distortion-free (DISC) watermarking technique with randomized initialization parameters to improve watermark robustness and reduce detection predictability. Uses variable initialization seeds to create more diverse watermarking patterns while maintaining the core DISC functionality.
- Randomized initialization
- Enhanced robustness
- Reduced predictability
- DISC-based architecture
Algorithm 3: DISC with Random Embedding
Distortion-free (DISC) watermarking technique that incorporates random embedding strategies to enhance watermark security and detection resistance. Implements stochastic embedding patterns while preserving the fundamental DISC detection capabilities.
- Random embedding patterns
- Enhanced security
- Detection resistance
- Stochastic patterns
Algorithm 4: MirrorMark
Token-level distortion-free probabilistic multibit watermarking algorithm that achieves higher capacity compared to DISC. This algorithm is very resilient to edits and preserves the watermark even after paraphrasing of the text.
- Token-level distortion-free
- High capacity
- Resilient to edits & paraphrasing
Algorithm 5: Auxiliary-LLM-Agnostic
A bridge between steganography and watermarking that achieves much higher embedding capacity. It is distortion-free, provides publicly verifiable (keyless) watermarking, and uses a simple non-statistical decoder.
- High embedding capacity
- Distortion-free
- Publicly verifiable / Keyless
- Simple non-statistical decoder
Algorithm 6: Auxiliary-LLM-Aware
A variation using both auxiliary and proprietary LLMs to achieve near zero bit-error-rate. Using the auxiliary LLM gives this algorithm a very high embedding capacity.
- Uses Auxiliary & Proprietary LLMs
- Near zero bit-error-rate
- High embedding capacity
Algorithm 7: Fact/Idea Preserving
In this watermarking algorithm, which can be based on any of the previously developed watermarks, the watermark is tied to factual or "meat" part of the text and if that "meat" is changes the watermark will be removed. This watermark is very useful for IP protection for firms in law or finance section.
- Tied to factual content
- Tamper-evident integrity
- Ideal for Law & Finance IP
Watermarking Examples
Text Watermarking
Example of non-watermarked and watermarked texts generated by LLAMA-2-7B with 24 bits embedded at 200 tokens
Prompt
Earthquake research has found that earthquakes are not evenly distributed across the earth, but are concentrated in certain specific strips. These strips where earthquakes are concentrated are called seismic belts.
Message (Payload) (24 bits)
111000110110110001010111
Non-Watermarked (Perplexity: 5.08)
The north coast of the Pacific Ocean and the east coast of Africa, these are the well-known zones or strips where earthquakes occur. This makes you wonder if earthquakes are random in occurrence. No, they are not. They occur due to the interaction of the plates and the stress they are exerting on the Earth.
Seismic Hazard Areas or Zones
After studying the geographical regions where earthquakes occur more frequently, it was found that these places are in areas with volcanoes or subduction zones. However, not all the volcanoes or subduction zones are exactly the same. Some are more active, and there are fewer earthquakes that occur in these areas. However, there are many volcanoes that are rarely seen or felt by the people in that area. So, what is the difference between these volcanoes? Are there some volcanoes that are more active than others?
Watermarked (Perplexity: 5.88)
In a seismic belt, one can observe a geographical zone where the earth's tectonic plates are interacting and forming tectonic boundaries. Also, these seismic belts or tectonic boundaries between the plates are usually found in areas where earthquakes occur. In these areas, there are usually complex subsurface structures such as volcanoes, subduction zones, and mountains. These structures and boundaries between plates generate stress that is observed by the people as earthquakes. There are many well-known seismic belts all over the world. However, most of the seismic belts or zones are found in places such as near the mid-ocean ridges. The mid-ocean ridges are located near the edges of the plates. Therefore, they contain seismic belts that separate plates from each other. The seismic belts are found in places such as
111000110111110001010111Code Watermarking
Example of non-watermarked and watermarked code samples generated by Qwen2.5-Coder-7B using DISC algorithm. Dataset: OpenAI HumanEval.
Prompt
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
Message (Payload) (4 bits)
1101
Non-Watermarked
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False
Watermarked
for i, a in enumerate(numbers):
for b in numbers[i + 1:]:
if abs(a - b) <= threshold:
return True
return False
if __name__ == "__main__":
import doctest
doctest.testmod()
1101Text Watermarking
Example of non-watermarked and watermarked texts generated by LLAMA-2-7B with 36 bits embedded at 200 tokens
Prompt
Earthquake research shows quakes cluster along specific belts rather than uniformly across the globe. These concentrations are called seismic belts.
Message (Payload) (36 bits)
110010101110111110110110110100000110
Non-Watermarked (Perplexity: 5.56)
The Pacific Ring of Fire, one of the most famous and densest seismic zones, covers the largest geographic area. The area between the Atlantic and Caribbean oceans is also densely populated with faults. The Pacific Ring of Fire is often compared to the "Atlantic" version. However, the Atlantic and Caribbean seismic belt is much more concentrated, a pattern not exhibited anywhere else on Earth. It is mostly on or just off the northwest coast of Newfoundland, Nova Scotia, and eastern New England, in the eastern half of Canada and the eastern and northern United States, with offshoots from the Bahamas to southern Florida/southeast United States. A smaller concentration crosses the southwest United States, north of Mexico, and up into the Baja peninsula.
Many see this intense concentration as a "mystery" or even a "sign". Others see these concentrations as
Watermarked (Perplexity: 7.61)
Quake magnitude 1.5 to 1.9
The belt usually doesn't appear on maps of earthquakes because it's too narrow to be shown. The Earth is about 4,000 miles across so belts have widths of less than 500 miles.
Areas with moderate levels of seismic activity where people don't normally feel any earthquakes are called plate margins.
Our planet is covered with huge plates made of solid rock. They travel slowly around the Earth. As one plate moves over another, the plates rub against each other. Friction between the moving plates makes earthquakes.
Earthquake Research
Scientists study quakes to learn about earthquake plate boundaries. Plate motion along the boundary often starts earthquakes.
Earthquake Research History
Science isn't usually very exciting to students, but earthquake science has some
110010101110111110110110110100000110
Text Watermarking
Example of non-watermarked and watermarked texts generated by LLAMA-2-7B with 155 bits embedded at 200 tokens
Prompt
Earthquake research shows quakes cluster along specific belts rather than uniformly across the globe. These concentrations are called seismic belts.
Message (Payload) (155 bits)
00000111111010001011011110100101010110001011011010110101000110110111100000110011111111111000100010100101000100011010011100101011011110100111000010100001000
Non-Watermarked (Perplexity: 3.76)
The boundaries of these belts define where earthquakes strike most commonly, as does their orientation.
Key Takeaways: Earthquakes and Seismic Belts
Seismic belts are areas where earthquakes cluster together, usually forming straight lines on a map.
Seismic belts form due to movement of the Earth’s crust, volcanic activity, fault lines, and areas where continental plates meet.
There are 15 primary seismic belts around the world. The largest and most active seismic belt is the Ring of Fire, which circles the Pacific Ocean.
Seismic belts include major regions of the world, such as Europe and California.
Seismic belts are also responsible for some of the deadliest earthquakes that the world has ever seen.
We can measure an earthquake in several ways, but the most popular method is known as seismographs.
Watermarked (Perplexity: 7.93)
There are 15 major seismic belts around the world. Thirty of the largest and most powerful earthquakes have originated from one of the major seismic belts. The belt of the highest risk for the Pacific Coast is a spotted belt centered in California.
These belts can stretch up to a thousand kilometers long and hundred kilometers wide, and most contain an average of a magnitude seven or greater every year. The western coastal belt of the continental united states and Canada stretches for about 3,115 km and contains around a third of a total of about 1,310 major active faults in North America. This seismic zone runs right along a first magnitude line - a line across North America where the continent experiences its highest magnitude earthquakes of about 8-12. This is the boundary between tectonic plates in the Pacific Northwest. There is
10000011111010011011011010100101110110001011011110110111000110110011101000110011111111111000100010100101000100011111011100101010010110100111000000000001000
Text Watermarking
Example of non-watermarked and watermarked texts generated by LLAMA-2-7B with 180 bits embedded at 200 tokens
Prompt
Earthquake research shows quakes cluster along specific belts rather than uniformly across the globe. These concentrations are called seismic belts.
Message (Payload) (180 bits)
001110010010111010101100001000111101101000011101000111001011101001111001000001011110011011010100100100110011000001000101100111110000100000000010011001011000111110001100001010110010
Non-Watermarked (Perplexity: 5.03)
There are 15 major seismic belts around the world. Thirty of the largest and most powerful earthquakes have originated from one of the major seismic belts. The belts of the world of this world and the seismic belts are centered in California, the border of Iran and Pakistan, northwestern Turkey and Hellenic Sea, Yunnan, China, Tian Shan in China and Kazakhstan, South China Sea, Nicaragua, east of the Caribbean Sea, Java, south of Japan, Vanuatu and New Guinea, Solomon Sea, Kuril Islands in Japan, northeast of Japan in the Kuril Islands, Antarctica, Kamchatka Peninsula in Russia, and Aleutian Islands. The belts of the world are centered in the Pacific Ring of Fire, around the Mediterranean Sea and around the Great Rift Valley of Africa.
Sources: Drew.
Watermarked (Perplexity: 7.75)
There are 15 major seismic belts around the world. Thirty of the largest and most deadly quakes have been recorded here since the early twentieth century, as well as countless others that were minor.
The belts, locations, and reasons why each belt is seismically active is explained in these videos from National Geographic, and will no doubt help you to better understand the risk each belt poses.
"Powerful earthquake centers found around the Pacific Ocean Rim and in Alaska are just two of many active seismic zones around the world. When these plates break, the earth reverberates in a large, often destructive earthquake, which may trigger huge tsunamis and mud slides." I am not particularly talented as far as cooking is concerned, I have to be very honest about it.
The truth about the matter is, I think most of us have
001111010010110100101100001000111101101000001101000110001001101001110001000001011110011011010100100100110011000001001101100011010000110000010010011001011001111110001100001010110010