Censys Platform Syntax Differences
Introduction
Censys introduced a new data model in the Censys Platform that includes three datasets: hosts, certificates, and web properties. Search functionality in the Platform was enhanced, both syntax and semantics have changed since Legacy Search. The Platform now uses a new query language that provides the following benefits:
- Universal search allows users to easily search across multiple datasets with one query.
- Search now includes a query validator that verifies syntax and assists with troubleshooting.
- Simplified regular expression (regex) searches and improved the use of operators.
Dataset representation in unified search
In Legacy Search, you had to indicate the dataset you were targeting for each query. With universal search in the Platform, you no longer have to do this.
You must modify queries and workflows used for Legacy Search to return similar results in the Platform.
In the new dataset, host
, cert
, and web
(new) have been prepended to all parsed data fields. These values indicate whether the field was detected on a host, certificate, or web property asset.
For example, in Legacy Search you would enter services.port: {21, 22, 23, 24, 25}
.
In the Platform, you enter host.services.port: {21, 22, 23, 24, 25}
to get the same results.
Web Properties
Censys added a new Web Properties dataset in the Platform that consists of websites, APIs, and web-based apps. Virtual hosts are now represented by web properties in the new platform. Web properties provide a more accurate, current, and comprehensive view of name-based assets compared to virtual hosts.
Web properties also support data from multiple HTTP endpoints. This allows for deeper visibility into application paths, such as /wp-admin or /login, providing insights into a web asset's structure and functionality.
Web properties are identified by a hostname and port pair. Hostnames can be name-based records (such as app.censys.io) or IP-based records (such as 104.18.10.85
). Example names of web property records include the following:
app.censys.io: 443
104.18.10.85: 8880
In Legacy Search, hosts are identified by an IP address. Virtual hosts are identified by a name and an IP address. Internet assets that respond to hostname-based scans are now classified as web properties.
Syntax changes
Syntax change | Explanation |
---|---|
Range operators | field: |
Wildcard characters outside of regex |
|
Operator |
|
Operator |
|
Regex |
|
Regex | Regex is unanchored in the Censys Platform, they can match anywhere within a field's value. To enforce an exact match, use |
Special characters | Special characters must be double-escaped with two backslashes. For example, |
Full text search | Complex queries in the Censys Platform use unquoted keywords and quoted multi-word values. The colon operator |
Relative time | You can use rounding and multiple comparison operators to be very specific about what dates you want to target. Using |
See Relative Time for more information.
Tokenization
The Platform uses tokenization to split text into searchable chunks, improving speed and efficiency. Instead of scanning the entire HTTP body as one large block of text, the platform breaks it into smaller tokens, increasing speed and precision.
The :
operator now performs case-insensitive substring matching. This means that searches are not exact matches but instead match any document where the field contains the specified value, regardless of case.
Tokenization examples
The two examples below describe how tokenization works in the Censys Platform.
- If you query
web.endpoints.http.body: "click save"
, our Platform locates all services running HTTP and scans the body. It tokenizes"click"
and"save"
separately and ensures that they are in close proximity to each other in the body. Optimized tokenization allows Censys to handle large amounts of data, including HTML bodies, titles, and metadata, without slowing down performance. - Similarly, when a user searches for
access=denied
, the query undergoes the same tokenization process, convertingaccess=denied
intoaccess denied
. When the Platform checks the document, it verifies whether the query's tokens appear in the correct order. Since the document containsaccess denied
(even though it was originally written asaccess+denied
), the search still matches.
Optimized tokenization also enhances full-text and regex searches by allowing them to run faster and more efficiently across all fields.
Updated 10 days ago