Censys Platform Syntax Differences

Introduction

Censys introduced a new data model in the Censys Platform that includes three datasets: hosts, certificates, and web properties. Search functionality in the Platform was enhanced, both syntax and semantics have changed since Legacy Search. The Platform now uses a new query language that provides the following benefits:

  • Universal search allows users to easily search across multiple datasets with one query.
  • Search now includes a query validator that verifies syntax and assists with troubleshooting.
  • Simplified regular expression (regex) searches and improved the use of operators.

Dataset representation in unified search

In Legacy Search, you had to indicate the dataset you were targeting for each query. With universal search in the Platform, you no longer have to do this.

You must modify queries and workflows used for Legacy Search to return similar results in the Platform.

In the new dataset, host, cert, and web (new) have been prepended to all parsed data fields. These values indicate whether the field was detected on a host, certificate, or web property asset.

For example, in Legacy Search you would enter services.port: {21, 22, 23, 24, 25}.

In the Platform, you enter host.services.port: {21, 22, 23, 24, 25} to get the same results.

Web Properties

Censys added a new Web Properties dataset in the Platform that consists of websites, APIs, and web-based apps. Virtual hosts are now represented by web properties in the new platform. Web properties provide a more accurate, current, and comprehensive view of name-based assets compared to virtual hosts.

Web properties also support data from multiple HTTP endpoints. This allows for deeper visibility into application paths, such as /wp-admin or /login, providing insights into a web asset's structure and functionality.

Web properties are identified by a hostname and port pair. Hostnames can be name-based records (such as app.censys.io) or IP-based records (such as 104.18.10.85). Example names of web property records include the following:

app.censys.io: 443

104.18.10.85: 8880

In Legacy Search, hosts are identified by an IP address. Virtual hosts are identified by a name and an IP address. Internet assets that respond to hostname-based scans are now classified as web properties.

Syntax changes

Syntax change

Explanation

Range operators

field: [* TO 10) has been replaced by (>,<,>=,=<) in the Censys Platform.

Wildcard characters outside of regex

* and ? can no longer be used as wildcards in values. Use regex instead.

Operator

(:*) Matches if the field contains any non-zero value.

Operator

(:) Use for a full-text search that uses tokenization (see below).

Regex

(=~) Query is successful if the field’s value matches against the regex provided in the query.

Regex

Regex is unanchored in the Censys Platform, they can match anywhere within a field's value. To enforce an exact match, use ^ at the beginning and $ at the end.

Special characters

Special characters must be double-escaped with two backslashes. For example, \w+ and \\.

Full text search

Complex queries in the Censys Platform use unquoted keywords and quoted multi-word values. The colon operator (:) now performs case-insensitive substring matching. Examples: my.field: foo my.field: "foo bar faz"

Relative time

You can use rounding and multiple comparison operators to be very specific about what dates you want to target. Using /[time variable] rounds to the nearest day, minute, hour, and month etc.

See Relative Time for more information.

Tokenization

The Platform uses tokenization to split text into searchable chunks, improving speed and efficiency. Instead of scanning the entire HTTP body as one large block of text, the platform breaks it into smaller tokens, increasing speed and precision.

The : operator now performs case-insensitive substring matching. This means that searches are not exact matches but instead match any document where the field contains the specified value, regardless of case.

Tokenization examples

The two examples below describe how tokenization works in the Censys Platform.

  • If you query web.endpoints.http.body: "click save", our Platform locates all services running HTTP and scans the body. It tokenizes "click" and "save" separately and ensures that they are in close proximity to each other in the body. Optimized tokenization allows Censys to handle large amounts of data, including HTML bodies, titles, and metadata, without slowing down performance.
  • Similarly, when a user searches for access=denied, the query undergoes the same tokenization process, converting access=denied into access denied. When the Platform checks the document, it verifies whether the query's tokens appear in the correct order. Since the document contains access denied (even though it was originally written as access+denied), the search still matches.

Optimized tokenization also enhances full-text and regex searches by allowing them to run faster and more efficiently across all fields.