Censys Platform Syntax Differences
Introduction
We introduced a new data model in the Censys Platform that includes three datasets: hosts, certificates, and web properties. Search functionality in the Censys Platform was enhanced, both syntax and semantics have changed since Legacy Search. The Censys Platform now uses a new query language that provides the following benefits:
- Universal search allows users to easily search across multiple datasets with one query
- Search now includes a query validator that verifies syntax and assists with troubleshooting
- Simplified regex searches and improved the use of operators
- Wildcard no longer supported, use regex instead
Dataset representation in unified search
In Legacy Search, you selected a dataset from the search bar in the dropdown. Due to universal search, you no longer have to do this. You must modify queries and workflows used for Legacy Search to return similar results in the Censys Platform.
In the new dataset, host
, cert
, and web
(new) have been prepended to all parsed data fields. These values indicate whether the field was detected on a host, certificate, or web property asset.
When you create a search, you must define the dataset by prepending your search with host.
, cert.
, or web.
.
For example, in Legacy Search you would enter services.port: {21, 22, 23, 24, 25}
.
In the Censys Platform, you enter host.services.port: {21, 22, 23, 24, 25}
to get the same results.
Now that we’ve covered dataset modifications, let’s explore the syntax changes required for searching in Censys Platform.
Web Properties
We added a new Web Properties dataset in the Censys Platform that consists of websites, APIs, and web-based apps. Virtual hosts are now represented by web properties in the new platform. Web properties provide a more accurate, current, and comprehensive view of name-based assets compared to virtual hosts.
Web properties also support data from multiple HTTP endpoints. This allows for deeper visibility into application paths, such as /wp-admin or /login providing insights into a web asset's structure and functionality.
Web properties are identified by their origin: hostname and port. Hostnames can be name-based records (such as app.censys.io) or IP-based records (such as 104.18.10.85
). Example names of web property records include the following:
app.censys.io: 443
104.18.10.85: 8880
In Legacy Search, hosts are identified by an IP address. Virtual hosts are identified by a name and an IP address. Internet assets that respond to hostname-based scans are now classified as web properties.
Syntax changes
Syntax change | Explanation |
---|---|
Range operators | field: |
Wildcard queries |
|
Operator |
|
Operator |
|
Regex |
|
Regex | Regex is unanchored in the Censys Platform, they can match anywhere within a field's value. To enforce an exact match, use |
Special characters | Special characters must be double-escaped with two backslashes. For example, |
Full text search | Advanced queries in the Censys Platform use unquoted keywords and quoted multi-word values. The colon operator |
Relative time | You can use rounding and multiple comparison operators to be very specific about what dates you want to target. Using |
See Relative Time for more information.
Tokenization
The Censys Platform optimizes tokenization to split text into searchable chunks, improving speed and efficiency. Instead of scanning the entire HTTP body as one large block of text, the platform breaks it into smaller tokens, increasing speed and precision.
For example, if you query web.http.body : "click save"
, our platform locates all services running HTTP and scans the body. It tokenizes "click"
and "save"
separately and ensures that they are in close proximity to each other in the body. Optimized tokenization allows Censys to handle large amounts of data, including HTML bodies, titles, and metadata, without slowing down performance.
Tokenization example
A document contains the text:
"The server recorded multiple authentication attempts. Status=failed. Error code: 403."
The Platform processes this text by removing whitespace and special characters, then breaks it into tokens:
user admin attempted login access denied error code 401
Similarly, when a user searches for access=denied
, the query undergoes the same tokenization process, converting access=denied
into access denied
. When the Platform checks the document, it verifies whether the query's tokens appear in the correct order. Since the document contains access denied (even though it was originally written as access+denied
), the search still matches.
Optimized tokenization also enhances full-text and regex searches by allowing them to run faster and more efficiently across all fields.
Updated 5 days ago