Dataset Differences: Censys Platform vs. Legacy Search

Censys introduced a new data model for describing hosts, certificates, and web properties on the Internet in the Censys Platform. This new data model is different from the one available in Legacy Search and features many improvements and additions. Queries and workflows that were used in Legacy Search need to be modified to return similar results in the Censys Platform.

Some of the most prominent changes include:

  • A new domain-specific language and syntax, Censys Query Language (CenQL), for querying the Censys Platform datasets.
  • Virtual hosts are now represented by web properties to describe and organize data for Internet assets that respond to name-based scans.
  • Additional context about web asset software, hardware, operating system, threats, and more to provide actionable information on what exactly mapped services and devices are, how they are configured, whether they are vulnerable, whether they are malicious, and where they live.
  • host, cert, or web have been prepended to all parsed data fields. These values indicate whether the field is present on a host, certificate, or web property asset.
  • Many values related to software or hardware that were previously present in the label field in Legacy Search can now be found instead in software and software.components fields for hosts and web properties.
    • Labels in the Censys Platform are primarily used to categorically describe services and hosts. Example label values include things like LOGIN_PAGE, OPEN_DIRECTORY, and CAMERA.
    • Labels in the Censys Platform datasets can be found in the host.labels.value, host.service.labels.value, and web.labels.value fields.

The sections below explore the changes mentioned above in greater detail.

The certificate data model available in the Censys Platform did not significantly change from that available in Legacy Search. However, all certificate-related fields are now prepended with cert.

A complete list of all data fields available in the Censys Platform is accessible from within the Censys Platform web interface.

Deprecation of virtual hosts and introduction of web properties

The Legacy Search dataset includes hosts and virtual hosts. In Legacy Search, hosts are identified by an IP address. Virtual hosts are identified by a name and an IP address.

Virtual hosts are not present in the Platform dataset. Instead, internet assets that respond to hostname-based scans are classified as web properties.

In the Platform, web properties are identified by a hostname and a port. Hostnames can be name-based records (such as app.censys.io) or IP-based records (such as 104.18.10.85). Example names of web property records include the following:

  • app.censys.io: 443
  • 104.18.10.85: 8880

Web properties provide a more accurate, up-to-date, and comprehensive view of name-based assets than virtual hosts. Classifying and presenting web assets as web properties enables Censys data to:

  • Better identify and showcase name-based assets, linking domain names and related services directly to their underlying infrastructure.
  • Support data from multiple HTTP endpoints.
  • Make searching for name-based assets feel like using a web browser.
  • Dramatically improve the freshness of web data. Web properties are refreshed daily.
  • Provide deeper visibility into application scans and the true footprint of internet assets, enhancing precision in asset discovery and threat monitoring.

How web properties differ from host services

Web properties offer insight into HTTP services beyond layer 7 while abstracting away HTTP protocol semantics.

  • Web properties contain all records that correlate to HTTP-based scans.
  • A web property can have one or more endpoints, which serve as distinct entry points to different resources or functionalities within the web property.
    • Each endpoint may be associated with specific applications, services, or APIs and can be individually monitored and analyzed.
  • Web properties support deep scan information for HTTP-based scanners.

When to search across web properties instead of host services

Search web properties when:

  • You want results that include hostnames.
  • You are targeting software that runs on top of HTTP, such as WordPress, pprof, Kubernetes, elastic search, and so on.
  • You are targeting software that services HTTP, like Apache or nginx.
  • You need HTTP body information.
  • You need data from endpoints other than /.

Do not use web properties when:

  • You want results that include IP addresses.
  • You are searching for DNS data, whois data, geolocation data, or routing data.
  • You are searching for hosts serving HTTP as well as non-HTTP protocols.

Previous limitations with virtual hosts in Legacy Search data model

Censys has historically scanned hosts with HTTP services running on them. There were several limitations with this approach:

  • In the Legacy Search data model, scanning hosts with HTTP services typically targets a single HTTP endpoint per host. By default, Censys aims to scan the root path (/) on HTTP and HTTPS ports (e.g., 80, 443) for each host. This approach makes it challenging to extract information from multiple endpoints like /wp-admin or /login.
    • Additionally, endpoints beyond the root may contain dynamic or context-specific content that varies based on parameters like session states, cookies, or user-agent headers. Scanning multiple or application-specific paths requires explicit scan configuration, reducing the ability to explore deeper or secondary paths systematically, which could result in missing critical information hosted on these endpoints.
  • If a host had an HTTP service running on 443 but Censys could also identify a Cobalt Strike application running on that HTTP service, the Legacy Search data model couldn’t support showing both services, which run on 443. This led to difficult decisions about prioritizing scan results which didn’t feel aligned with our mission.
  • Some users struggle to understand when and whether to include or exclude virtual hosts as targets of their search queries.
  • Data on virtual hosts could be up to 45 days stale, making them less useful.
Modified data field names

In the Platform dataset, some field names have changed, moved, or been removed. The following are examples of popular fields that moved betweenLegacy Search and the Platform. A complete list of fields that have moved or changed is available here.

Legacy Search fieldPlatform field
services.service_namehost.services.protocol
services.http.response.bodyhost.services.endpoints.http.body
Context on hosts and web properties

The data model featured in the Censys Platform surfaces new, actionable context on what exactly mapped services and devices are, how they are configured, whether they are vulnerable, whether they are malicious, and where they live.

This new context allows users to:

  • Quickly find a specific type of host, web property, or service.
  • Take advantage of uniform metadata fields like CPEs, software, hardware, and labels.
  • Quickly and easily conduct investigations and incorporate Censys data into automation workflows.

The following new context types are available on hosts and web property records. Not all context will be available to all users.

  • Hardware
  • Software
  • Operating System
  • Threats
  • Vulnerabilities
  • Exposures
  • Labels
  • Misconfigurations

Hardware

By extracting hardware into its own object, the Platform dataset is able to provide details about the hardware itself (such as a Juniper Router) while also listing any known information about the hardware’s components, such as the processor type and firmware. This provides a structure to give more context about the kinds of devices that may be vulnerable.

Software

By extracting software into its own object, Censys data captures specific details about the software itself (such as Microsoft SQL Server) while also including any relevant information about its components, like version numbers on hosts and web properties.

This structured approach allows the Platform to offer more context on software configurations and versions that may present vulnerabilities or security concerns, as well as highlight associations with other software or hardware dependencies, helping users understand the potential exposure or compatibility across diverse systems.

Field TypeDescription
.services.software.components.vendorVendor or organization responsible for creating or maintaining the software component
.services.software.components.cpeCommon Platform Enumeration (CPE) identifier for the software component
.services.software.components.versionVersion number of the software component
.services.software.components.partSpecifies the type of software component such as web-server, proxy-server, botnet-server, and so on
.services.software.components.productProduct name of the software component
.services.software.components.updateDescribes the update version for the software component such as major, minor, or patch
.services.software.components.editionSpecifies the edition of the software component, such as Standard, Enterprise, and so on
.services.software.componentsRepresents individual components within the software, allowing for detailed specification of submodules that the primary software relies on
.services.software.components.life_cycleLifecycle details of a software component, such as release and end-of-life dates, providing insight into support and maintenance periods
.services.software.components.life_cycle.end_of_life_dateDate on which support for the software component officially ended
.services.software.components.life_cycle.release_dateInitial release date of the software component
.services.software.components.life_cycle.end_of_lifeIndicates whether the software component has reached its end-of-life status (true if support is discontinued)

Threats

In the Platform data model, Censys defines “threat” using NIST’s primary definition of “cyber threat”:

Any circumstance or event with the potential to adversely impact organizational operations (including mission, functions, image, or reputation), organizational assets, or individuals through an information system via unauthorized access, destruction, disclosure, modification of information, and/or denial of service. Also, the potential for a threat-source to successfully exploit a particular information system vulnerability.

**Threats are limited in scope as they are new objects and will be made available starting in 2025. **The fields in the threat object provide a comprehensive view of threat information detected on services on hosts and web properties. Each field contributes to the detailed profiling and validation of potential threats, leveraging both Censys proprietary methods and industry-standard practices.

Field TypeDescription
.services.threats.confidenceShows Censys level of confidence in the threat detection, with values indicating how likely it is that the identified threat is accurate
.services.threats.evidence.negativeLogs evidence suggesting a threat might not be present, helping to minimize false positives
.services.threats.evidence.proprietaryShows evidence obtained from Censys proprietary methods, adding further context to the detection
.services.threats.evidence.regexPattern-matching expressions used to detect signs of threats through specific sequences in data
.services.threats.evidence.semver_expressionVerifies vulnerabilities using software versioning rules based on known patterns
.services.threats.evidence.data_pathPinpoints the location within the data structure where threat evidence was detected
.services.threats.evidence.existsIndicates whether key evidence supporting the presence of a threat was found
.services.threats.evidence.found_valueDisplays the exact value supporting the threat detection
.services.threats.evidence.literal_matchConfirms an exact match for a known threat indicator, ensuring precise identification
.services.threats.namesProvides a list of all known names and aliases for the detected threat, ensuring users can recognize it regardless of naming variations
.services.threats.sourceIdentifies the origin of the threat information, whether from external intelligence sources or Censys analysis
.services.threats.tacticCategorizes the threat according to its objective (e.g., reconnaissance, lateral movement) based on standardized threat behavior models like MITRE ATT&CK
.services.threats.typeCategorizes the threat type (e.g., c2-server, phishing-server) to help identify the nature of the threat

Changes to labels between Legacy Search and the Platform

In the Legacy Search dataset, labels are used for multiple purposes, ranging from indicating software manufacturers to describing records using descriptors like “network.device” or “login-page.”

There are fewer label values in Platform than in Legacy Search. This is partially because “labels” in Search 2.0 that were actually unstructured software, hardware, or operating system data have been moved to the appropriate component fields (e.g., jQuery, bootstrap) in the Platform.

The table below lists a few of the labels available in Platform:

Label NameDescription
IPV6Entity identified as an IPv6 host
login-pageEntity has an HTTP service that appears to host a login page
open-dirWeb Server with an exposed directory listing
suspicious-open-dirWeb Server with Suspicious Open Directory
Platform and Legacy Search data model examples

IP host data model example
The following is an example of a host record in the Platform data model. A host represents strictly host-level information about an IP address. This includes its location, routing, and IP whois enrichments. All context included about a host is derived from service scan data.

{
    "dns": {
        // DNS Names and Forward DNS Data
    },

    "ip": <ipv4 or ipv6 address>,
    "service_count": <count of services running on host>,
    "truncated": <true or false>,

    "location": {
        // Location details such as continent, country, country_code, city, postal_code, timezone, and co-ordinates
    },

    "routing": {
        // Routing information such as ASN,BGP Prefix, BGP name and country
    },

     "services": [
        {
         "port": <port #>,
         "protocol": <protocol name>,
         "transport_protocol": <transport protocol - TCP / UDP, QUIC>,
         "misconfigs": [],
         "exposures": [],
         "vulns": [],
         "software": [],
         "hardware": [],
         "operating_systems": [],
         "threats": [],
         "labels": [],
         "ip": <ipv4 or ipv6 address>,
         "scan_time": "2024-10-20T20:07:50.000Z",
         "banner": "",
         "banner_hash_sha256": "",
         "<service specific details>": {
            // Details about service repeated for each identified service
        }
     }
   ],
    "whois": {
        // WhoIS data about host includes name, CIDRs, organization, contacts
},

    "labels": []
}

Search 2.0 host data model example

The Search 2.0 data model provides cohesive records about individual IPv4 hosts, decoupled ports, and protocol data. The 2021 model presents top-level information about hosts (IP address, location, routing information) and then an array of statically defined services.

In the Search 2.0 data model, Censys combined all information about a specific protocol in a single record.

For example, in the Search 2.0, the host “8.8.8.8” is presented as follows:

{
          “ip”: ”8.8.8.8”,
          “services”: [
                    {
                        “port”:80,
                        “service”:”http”,
                        “http”:{
                                  “title”: “Hello World!”
                     }
          ],
          “location”:{
                        “city”: “Ann Arbor”,
                      ...
          },
          ...
}

If, for example, this had been a MySQL service, there would be a “mysql” subrecord instead of an “http” one.

IIP web property data model example
The following is an example of a web property presented in the IIP data model.

{

    "hostname": <host name>,
    "port": <port number>,

     "endpoints": [ // repeat section for each endpoint on the asset 
        {
            "hostname": <host name>,
            "port": <port number>,
            "path": <path name>,
            "endpoint_type": "HTTP",
            "transport_protocol": "TCP",
            "scan_time": "2024-10-12T12:22:18.167Z",
            "banner": <banner details>,
            "banner_hash_sha256": <banner sha256 hash>,
            "http": {
                "supports_http2": false,
                "uri": <end point URI>,
                "protocol": "HTTP/1.1",
                "status_code": <status code of endpoint>,
                "status_reason": <status reason>,
                "headers": {
                          // Location and location headers 
                          // Content length and headers 
                          // Server headers 
                          // Date headers
                          // Cache-control headers                 
                 },
                "html_tags": [
                          // List of HTML tags
                 ],
                "body_size": <size of body>,
                "body": <body information>,
                "favicons": [
                         // List of favicons 
                         // Includes size, name, hash_sha256, hash_md5
                ],
                "body_hash_sha256": <body sha256 hash>,
                "body_hash_sha1": <body sha1 hash>
            }
        },
    "exposures": [],
    "hardware": [],
    "labels": [],
    "misconfigs": [],
    "operating_systems": [],
    "scan_time": "2024-10-12T12:23:13.926Z",
    "software": [],
    "threats": [],
    "vulns": []
}