<< Back to Categories

Database Format

IPQS Database Format Specifications

This document is intended for anyone who wishes to build a reader for our proxy detection database in a currently unsupported language or to understand better how our data is structured. If you would like us to support your language, contact us. The flat file database is divided into four sections: headers, tree, records, and record data. Each section has a specific purpose within the format.

We designed the database format so we could easily update or extend it to meet a wide variety of needs while shrinking the file size of the data on disk. If you are building your own reader against this format, you should design the reader to be flexible in allowing for previously undefined columns, new bitmasks, or changes to existing bitmasks.

Glossary

Terminology	Description
Bitmasks	Our database frequently uses bitmasks. A bitmask is a byte in which each individual bit represents a true or a false. Thus, each byte can contain 8 "binary options." For example, 00000001 would contain 7 "false" values and 1 "true" value.
Records	This documentation will frequently reference "records". Records represent a blob of data about an IP address or multiple IP addresses. Each record will consist of 1 to 3 bytes of bitmasks and a variable number of columns, as defined in the header section.
Columns	The documentation will frequently reference "columns". Columns are defined in the header section. A variable number of these exist in each record. Each column represents one piece of data in the file, such as Country or ISP. The order, number, and defined data type of columns defined in the header section exactly match the order, number, and data type found in each record. If the header defines a column, that column will appear in the same spot on every record and always be the same data type (string, int, etc.).
Pointers	The documentation will frequently reference "pointers". A pointer in the context of the database is an unsigned 32-bit integer that tells the reader to jump to another section in the file. The number corresponds to the number of bytes from the start of the file to the point the reader should jump to.
Enum	The documentation will reference "enums". An enum is a numerical map in which a number represents a string or piece of data (e.g., 1 = IP, 2 = Email, etc.).

Headers

The header section of the file describes the data found in the file and aids the reader in knowing where and how to look for records in the file. The header section of the database file consists of a few different components. The first 11 bytes of every DB file contain data required for the reader to function:

Byte Numbers

Data Type

Description

Byte 0

Bitmask designating the Block Type of this file (IPv4/IPv6) and if the individual records start with 1 or 3 bytes of bitmasks.

Bit Number	Description
0	If true the file is a IPv4 file.
1	If true the file is a IPv6 file.
2	If true the file is a blocklist file
3 - 6	Reserved for future use.
7	If true each record will start with 3 bytes of bitmasks. If false each record will start with 1 byte of bitmasks. (See record definition below.)

Byte 1

Unsigned one byte integer representing the format version this file was intended for.

The current file version is 1. Readers should refuse to read files that are not made for the version they were written for. While backward compatibility of the format will attempt to be maintained going forward there is no guarantee any version of the format will be compatible with any other version.

Bytes 2-4

Unsigned three byte integer containing the total number of bytes in the header section. (Header Size)

These bytes, when converted to an integer, are used to tell the reader how many bytes to read to get the full header section. This is almost always 11 + (24 * number of columns).

Bytes 5-6

Unsigned two byte integer containing the total number of bytes in each record.

These bytes are used to tell the reader how many bytes to read to get the full record length. Each record is identically sized.

Bytes 7-10

Unsigned four byte integer containing the total number of bytes in the entire file.

This is used to tell your reader if the next position read will be outside the bounds of the file.

Bytes 11+

Column descriptions and column data type pairs.

The remainder of the header section contains a variable number of Column Pairs. Each Column Pair is 24 bytes in length. You can calculate the total number column pairs by taking the Header Size from bytes 1 - 3 and subtracting 11 (the number of guaranteed header bytes) then divide by 24. (AKA: Header Size - 10 / 24) As an example if the Header Size is 130, there would be 5 columns for every record in the file.

Each column pair is broken down as follows:

Bytes 0 to 22: The ASCII description of the column up to 23 characters. Unused characters are right padded 0's.

Byte 23: The data type bitmask for this column (string, int, etc). See Data Type bitmasks below.

Tree

After the last header byte, the tree starts. For performance reasons, it is recommended to cache the Header Size and Column Pairs in memory to reduce the need to read the header on each lookup. This way, your reader can jump straight to the first tree byte to start a lookup.

The tree, a binary tree to be precise, plays a pivotal role in the lookup process. Your code will rely on a series of left or right decisions, guided by the tree, to pinpoint the corresponding record for a given IP. The tree is composed of 5 bytes of header data, followed by a series of 8-byte branches. The tree's format is detailed below:

Byte Number

Data Type

Description

Byte 0

Bitmask designating that this block type is a tree.

This byte is mostly just used for formatting reasons and can typically be ignored unless you're writing a database validator.

Bit #	Description
0-1	Reserved for future use.
2	If true this is a tree block.
3-6	Used for other Block Types.
7	Reserved for future use.

Bytes 1-4

Unsigned four byte integer containing the total number of bytes in the tree section. (Tree Bytes)

Used for calculating if the next pointer is in the tree section.

Bytes 5-12

First tree pointer pair.

The tree section from this point onward can be broken up into 8 byte segments or "nodes". Each node can be divided into two, 4 byte, unsigned integers. Each integer is a pointer to another position in the file. If the integer is less than the highest possible byte in the tree section (Tree Start + Tree Bytes) then the next position is another binary tree node. If the position is greater than the highest possible byte in the tree section then the next position is either a Record (if less than total bytes in the file) or an invalid IP (integer = 0 or higher than total bytes in file).

You can traverse this binary tree by converting any IP address into it's big endian binary format. Iterate over each bit in the IP from left to right. If the bit equals 0 follow the LEFT pointer (first 4 bytes) and move to the next bit. If the bit equals 1 follow the RIGHT pointer (last 4 bytes) and move to the next bit. Remember each pointer is the total number of bytes from the start of the file to the next read position.

Bytes 5-8: Unsigned 32 Bit int. IP Bit 0 is 0.

Bytes 9-12: Unsigned 32 Bit int. IP Bit 0 is 1.

As an example the IPv4 sample file contains the IP address 8.8.0.0 which is 00001000000010000000000000000000 in binary. Thus when traversing the tree the first 4 decisions will use the LEFT (or first 4 bytes) of each node, then the 5th decision will use the RIGHT (or last 4 bytes) of it's node. This process is repeated until your next pointer equals zero or exceeds the total (Tree Start + Tree Bytes).

If you reach a pointer that points to zero then the exact IP isn't in the file. If this file has been designated as a "BlocklistFile" by the first header byte you should error and abort the lookup. If the file is not a blocklist file you should move the pointer back up the tree to the nearest "1", use the zero branch and assume all forward decisions are "1"s until you either land at a record or hit another 0. For a code example of this process please see Golang Database Reader.

Record

Each record starts with 1 or 3 bytes of Bit Mask for various options.

Block Type Bit 7 is True

If Block Type bit 7 is true in the file's headers, then the record will start with 3 bytes of bitmasks as detailed below:

Byte 0: This byte contains data about the usage of this IP address:

Bit #	Description
0	True if this IP is likely to be a proxy.
1	True if this IP is likely to be a VPN.
2	True if this IP is likely to be a TOR node.
3	True if this IP is a known search engine crawler.
4	True if this IP is likely to be a bot.
5	True if this IP has recently been seen committing abusive actions.
6	True if this IP is on a blocklist or has been on a blocklist recently.
7	True if this IP is a private non-routable IP address.

Byte 1: This byte contains data about the usage of this IP address:

Bit #	Description
0	True if this IP is likely to belong to a mobile carrier.
1	True if this IP has or recently had open (listening) ports.
2	True if this IP is or was recently used by a hosting provider.
3	True if this IP is likely to be an active VPN provider.
4	True if this IP is likely to be an active TOR node.
5	True if this IP is likely to be a public access point (coffee shop, library, campus, etc)...
6	This bit is reserved for future use or custom applications.
7	This bit is reserved for future use or custom applications.

Byte 2: This byte contains data about the type of IP address and it's recent abuse:

Bit #

Description

This bit is reserved for future use or custom applications.

3-5

Three bit unsigned integer enum representing the connection type of this IP address.

#	Enum Description
1	Residential IP
2	Mobile IP
3	Corporate IP
4	Data Center IP
5	Educational IP

6-7

#	Enum Description
1	Low Recent Abuse IP
2	Medium Recent Abuse IP
3	High Recent Abuse IP

Block Type Bit 7 is False

If Block Type bit 7 is false in the file's headers, then the first and only byte of bitmasks will be as follows:

Byte 0: This byte contains data about the type of IP address and it's recent abuse.

Bit #

Description

This bit is reserved for future use or custom applications.

3-5

Three bit unsigned integer enum representing the connection type of this IP address.

#	Enum Description
1	Residential IP
2	Mobile IP
3	Corporate IP
4	Data Center IP
5	Educational IP

6-7

#	Enum Description
1	Low Recent Abuse IP
2	Medium Recent Abuse IP
3	High Recent Abuse IP

Record Columns & Data Types

After the bitmasks, a variable number of Columns based on the header section will be listed. The order of these columns will exactly match the header. Use the Data Type value from the column's header to determine the length of each set of data as described below:

#	Bitmask
0 - 1	Reserved for future use.
2	Reserved for tree block use.
3	This column is a string pointer. This will be represented on the record as an unsigned 4 byte integer. Follow the pointer to a piece of string data. See String Data below.
4	This column is a small integer. Small integers are one byte unsigned integers stored directly on the record.
5	This column is a integer. Integers are four byte unsigned integers stored directly on the record.
6	This column is a float. Floats are four byte unsigned floats corresponding to the IEEE 754 binary representation b, with the sign bit of b and the result in the same bit position. Floats are stored directly on the record.
7	Reserved for future use.

Record Data

Currently, only one type of record data is not directly stored on the record: String Data. Future changes to the format would allow up to four other kinds of data to be stored in this section of the file.

String Data

The first byte of each string is an unsigned 8-bit integer. This integer specifies the length of the string in question (n). The following n bytes are the string itself.

Supported Columns

If you are designing your own reader implementation, you should design it to support a default set of columns and expected value types. The supported column types are listed below:

Name	Type	Description
Country	String Data	Two character country code of IP address or "N/A" if unknown. Technically this could be stored on the record for a two byte savings, however it's possible users may want a full string in the future so this has been left out.
City	String Data	City of IP address if available or "N/A" if unknown.
Region	String Data	Region (state) of IP address if available or "N/A" if unknown.
ISP	String Data	ISP if one is known. Otherwise "N/A".
Organization	String Data	Organization if one is known. Can be parent company or sub company of the listed ISP. Otherwise "N/A".
Timezone	String Data	Timezone of IP address if available or "N/A" if unknown.
ASN	Integer Data	Autonomous System Number if one is known. Null if nonexistent. Stored on the record.
Zero Fraud Score	Integer Data	The "strictness" = 0 fraud score for this IP address. (See Proxy & VPN Detection for details about strictness.) Stored on the record.
One Fraud Score	Integer Data	The "strictness" = 1 fraud score for this IP address. (See Proxy & VPN Detection for details about strictness.) Stored on the record.
Two Fraud Score	Integer Data	The "strictness" = 2 fraud score for this IP address. (See Proxy & VPN Detection for details about strictness.) Stored on the record.
Latitude	Float Data	Latitude of IP address if available or "N/A" if unknown. Stored on the record.
Longitude	Float Data	Longitude of IP address if available or "N/A" if unknown. Stored on the record.

Ready to eliminate fraud?

Start fighting fraud now with 5,000 Free Lookups!

We're happy to answer any questions or concerns.

Chat with our fraud detection experts any day of the week.

Call us at: (800) 713-2618

Schedule a Demo Sign Up