IP Address Flat File Format Specification

IPQS IP Database Format Specifications

This document is intended for anyone who wishes to build a reader for our proxy detection database in a currently unsupported language or who wishes to better understand how our data is structured. If you would like us to support your language feel free to contact us. Our flat file database is broken down into 4 sections (in order): headers, tree, records and record data. Each section has a specific purpose within the format.


A Note From The Author

This format was written so that the format could easily be updated or extended to meet a wide variety of needs while shrinking the file size of the data on disk dramatically. Readers written against this format should attempt to be flexible in allowing for previously undefined columns, new bitmasks, or changes to existing bitmasks.


Preliminary Knowledge
Bitmasks

Our database makes frequent use of bitmasks. A bitmask is a byte in which each individual bit represents a true or a false. Thus each byte can contain 8 "binary options". As an example 00000001 would be 7 "false" values and 1 "true" value.

Records

The documentation will frequently reference "records". Records represent a blob of data about an IP address or multiple IP addresses. Each record will consist of 1 to 3 bytes of bitmasks and a variable number of columns as defined in the header section.

Columns

The documentation will frequently reference "columns". Columns are defined in the header section and a variable number of these exist in each record. Each column represents one piece of data in the file such as Country or ISP. The order, number and defined data type of columns defined in the header section exactly matches the order, number and data type found in each record. If the header defines a column that column will appear in the same spot on every record and will always be the same data type (string, int, etc.).

Pointers

The documentation will frequently reference "pointers". A pointer in the context of the database is an unsigned 32 bit integer that tells the reader to jump to another section in the file. The number corresponds to the number of bytes from the start of the file to the point the reader should jump to.

Enum

The documentation will reference "enums". An enum is a numerical map where a number represents a string or piece of data. Eg: 1 = IP, 2 = Email, etc...


Headers

The header section of the file describes the data found in the file and aids the reader in knowing where and how to look for records in the file. The header section of the database file consists of a few different components. The first 11 bytes of every DB file contain data required for the reader to function:

Byte 0 Bitmask designating the Block Type of this file (IPv4/IPv6) and if the individual records start with 1 or 3 bytes of bitmasks.
Bit # Description
0 If true the file is a IPv4 file.
1 If true the file is a IPv6 file.
2 - 6 Used for other Block Types.
7 If true each record will start with 3 bytes of bitmasks. If false each record will start with 1 byte of bitmasks. (See record definition below.)
Byte 1 Unsigned one byte integer representing the format version this file was intended for.
The current file version is 1. Readers should refuse to read files that are not made for the version they were written for. While backward compatibility of the format will attempt to be maintained going forward there is no guarantee any version of the format will be compatible with any other version.
Byte 2 - 4 Unsigned three byte integer containing the total number of bytes in the header section. (Header Size)
These bytes, when converted to an integer, are used to tell the reader how many bytes to read to get the full header section. This is almost always 11 + (24 * number of columns).
Byte 5 - 6 Unsigned two byte integer containing the total number of bytes in the header section.
These bytes are used to tell the reader how many bytes to read to get the full header.
Byte 7 - 10 Unsigned four byte integer containing the total number of bytes in the entire file.
This is used to tell your reader if the next position read will be outside the bounds of the file.
Byte 11+ Column descriptions and column data type pairs.

The remainder of the header section contains a variable number of Column Pairs. Each Column Pair is 24 bytes in length. You can calculate the total number column pairs by taking the Header Size from bytes 1 - 3 and subtracting 11 (the number of guaranteed header bytes) then divide by 24. (AKA: Header Size - 10 / 24) As an example if the Header Size is 130, there would be 5 columns for every record in the file.

Each column pair is broken down as follows:

Byte # Description
0 - 22 The ASCII description of the column up to 23 characters. Unused characters are right padded 0's.
23 The data type bitmask for this column (string, int, etc...). See Data Type bitmasks below.


Tree

After the last header byte is where the tree starts. It is recommended for performance reasons to cache the Header Size and Column Pairs in memory to reduce reading the header on each lookup. This way your reader can jump straight to the first tree byte to start a lookup.

The tree is a binary tree. Your code will make a series of left or right decisions to locate the corresponding record for a given IP. The tree consists of 5 bytes of header data followed by a series of 8 byte branches. The tree's format is detailed below:

Byte 0 Bitmask designating that this block type is a tree.

This byte is mostly just used for formatting reasons and can typically be ignored unless you're writing a database validator.

Bit # Description
0 - 1 Used for other Block Types
2 If true this is a tree block.
3 - 7 Used for other Block Types.
Byte 1 - 4 Unsigned four byte integer containing the total number of bytes in the tree section. (Tree Bytes)
Used for calculating if the next pointer is in the tree section.
Byte 5 - 12 First tree pointer pair.

The tree section from this point onward can be broken up into 8 byte segments or "nodes". Each node can be divided into two, 4 byte, unsigned integers. Each integer is a pointer to another position in the file. If the integer is less than the highest possible byte in the tree section (Tree Start + Tree Bytes) then the next position is another binary tree node. If the position is greater than the highest possible byte in the tree section then the next position is either a Record (if less than total bytes in the file) or an invalid IP (integer = 0 or higher than total bytes in file).


You can traverse this binary tree by converting any IP address into it's big endian binary format. Iterate over each bit in the IP from left to right. If the bit equals 0 follow the LEFT pointer (first 4 bytes) and move to the next bit. If the bit equals 1 follow the RIGHT pointer (last 4 bytes) and move to the next bit. Remember each pointer is the total number of bytes from the start of the file to the next read position.

Node (Bytes 5 - 12)
5
8
Unsigned 32 Bit Int
IP Bit 0 is 0
9
12
Unsigned 32 Bit Int
IP Bit 0 is 1

As an example the sample file contains the IP address 8.8.0.0 which is 00001000000010000000000000000000 in binary. Thus when traversing the tree the first 4 decisions will use the LEFT (or first 4 bytes) of each node, then the 5th decision will use the RIGHT (or last 4 bytes) of it's node. This process is repeated until your next pointer equals zero or exceeds the total (Tree Start + Tree Bytes).



Record

Each record starts with 1 or 3 bytes of Bit Mask for various options. If Block Type bit 7 is true in the file's headers then the record will start with 3 bytes of bitmasks as detailed below:

Byte 0 Bitmask

This byte contains data about the usage of this IP address.

Bit # Description.
0 True if this IP is likely to be a proxy.
1 True if this IP is likely to be a VPN.
2 True if this IP is likely to be a TOR node.
3 True if this IP is a known search engine crawler.
4 True if this IP is likely to be a bot.
5 True if this IP has recently been seen committing abusive actions.
6 True if this IP is on a blacklist or has been on a blacklist recently.
7 True if this IP is a private non-routable IP address.
Byte 1 Bitmask

This byte contains data about the usage of this IP address.

Bit # Description.
0 True if this IP is likely to belong to a mobile carrier.
1 True if this IP has or recently had open (listening) ports.
2 True if this IP is or was recently used by a hosting provider.
3 True if this IP is likely to be an active VPN provider.
4 True if this IP is likely to be an active TOR node.
5 True if this IP is likely to be a public access point (coffee shop, library, campus, etc)...
6 This bit is reserved for future use or custom applications.
7 This bit is reserved for future use or custom applications.
Byte 2 Bitmask

This byte contains data about the type of IP address and it's recent abuse.

Bit # Description.
0 This bit is reserved for future use or custom applications.
1 This bit is reserved for future use or custom applications.
2 This bit is reserved for future use or custom applications.
3 - 5

Three bit unsigned integer enum representing the connection type of this IP address.

# Enum Description
1 Residential IP
2 Mobile IP
3 Corporate IP
4 Data Center IP
5 Educational IP
6 - 7

Two bit unsigned integer enum representing the recent abuse velocity of this IP address.

# Enum Description
1 Low Recent Abuse IP
2 Medium Recent Abuse IP
3 High Recent Abuse IP

If Block Type bit 7 is false in the file's headers then the first and only byte of bitmasks will be as follows:

Byte 0 Bitmask

This byte contains data about the type of IP address and it's recent abuse.

Bit # Description.
0 This bit is reserved for future use or custom applications.
1 This bit is reserved for future use or custom applications.
2 This bit is reserved for future use or custom applications.
3 - 5

Three bit unsigned integer enum representing the connection type of this IP address.

# Enum Description
1 Residential IP
2 Mobile IP
3 Corporate IP
4 Data Center IP
5 Educational IP
6 - 7

Two bit unsigned integer enum representing the recent abuse velocity of this IP address.

# Enum Description
1 Low Recent Abuse IP
2 Medium Recent Abuse IP
3 High Recent Abuse IP
Record Columns / Data Types

After the bitmasks a variable number of Columns based on the header section will be found. The order of these columns will exactly match the header. Use the Data Type value from the column's header to determine the length of each set of data as described below:

# Bitmask
0 - 2 Used for other Block Types or reserved for future use.
3 This column is a string pointer. This will be represented on the record as a unsigned 4 byte integer. Follow the pointer to a piece of string data. See String Data below.
4 This column is a small integer. Small integers are one byte unsigned integers stored directly on the record.
5 This column is a integer. Integers are four byte unsigned integers stored directly on the record.
6 This column is a float. Floats are four byte unsigned floats corresponding to the IEEE 754 binary representation b, with the sign bit of b and the result in the same bit position. Floats are stored directly on the record.
7 Used for other Block Types or reserved for future use.

Record Data

At the moment there is only one type of record data not directly stored on the record: String Data. Changes to the format in the future would potentially allow for up to 4 other kinds of data to be stored in this section of the file.


String Data

The first byte of each String is a unsigned 8-bit integer. This integer specifies the length of the String in question (n). The following n bytes are the string itself.


Supported Columns

Most reader implementations should support a default set of columns and expected value types. For reference they're listed below:

Name Type Description
Country String Data Two character country code of IP address or "N/A" if unknown. Technically this could be stored on the record for a two byte savings, however it's possible users may want a full string in the future so this has been left out.
City String Data City of IP address if available or "N/A" if unknown.
Region String Data Region (state) of IP address if available or "N/A" if unknown.
ISP String Data ISP if one is known. Otherwise "N/A".
Organization String Data Organization if one is known. Can be parent company or sub company of the listed ISP. Otherwise "N/A".
Timezone String Data Timezone of IP address if available or "N/A" if unknown.
ASN Integer Data Autonomous System Number if one is known. Null if nonexistent. Stored on the record.
Zero Fraud Score Integer Data The "strictness" = 0 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record.
One Fraud Score Integer Data The "strictness" = 1 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record.
Two Fraud Score Integer Data The "strictness" = 2 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record.
Latitude Float Data Latitude of IP address if available or "N/A" if unknown. Stored on the record.
Longitude Float Data Longitude of IP address if available or "N/A" if unknown. Stored on the record.
Sample File

For testing purposes a sample file can be found here. The file is gzipped, decompress it before use. The sample file contains data for the 8.8.0.0 IP address.