Skip to content

Latest commit

 

History

History
216 lines (178 loc) · 7.55 KB

data_generator.md

File metadata and controls

216 lines (178 loc) · 7.55 KB

data_generator

A reader like processor that generates sample data. The default data generator creates randomized data fitting the data format in the examples. The processor can also create customized data records if provided a schema that works with the mocker-data-generator package, see the examples below for details.

Usage

Generate data

Example of jobs using the data_generator processor

{
    "name" : "testing",
    "workers" : 1,
    "slicers" : 1,
    "lifecycle" : "once",
    "assets" : [
        "standard"
    ],
    "operations" : [
        {
            "_op": "data_generator",
            "size": 10000
        },
        {
            "_op": "noop"
        }
    ]
}

Output of the example job

const slice = { count: 1000 }

const results = await fetcher.run(slice);

results.length ==== 1000;

results[0] === {
    "ip": "1.12.146.136",
    "userAgent": "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:8.9) Gecko/20100101 Firefox/8.9.9",
    "url": "https://gabrielle.org",
    "uuid": "408433ff-9495-4d1c-b066-7f9668b168f0",
    "ipv6": "8188:b9ad:d02d:d69e:5ca4:05e2:9aa5:23b0",
    "location": "-25.40587, 56.56418",
    "created": "2016-01-19T13:33:09.356-07:00",
    "bytes": 4850020
}

Generate custom data within a date range

Example of a job using custom schema to generate records with dates between 2015-08-01T10:33:09.356Z and 2015-12-30T20:33:09.356Z

{
    "name" : "testing",
    "workers" : 1,
    "slicers" : 1,
    "lifecycle" : "once",
    "assets" : [
        "standard"
    ],
    "operations" : [
        {
            "_op": "data_generator",
            "json_schema": "some/path/to/file.js",
            "format": "isoBetween",
            "start": "2015-08-01T10:33:09.356Z",
            "end": "2015-12-30T20:33:09.356Z",
            "date_key": "joinDate"
        },
        {
            "_op": "noop"
        }
    ]
}

Example schema located at some/path/to/file.js

export default const schema = {
    firstName: {
        faker: 'name.firstName'
    },
    lastName: {
        faker: 'name.lastName'
    },
    country: {
        faker: 'address.country'
    }
}

Example output from the above job:

const slice = { count: 2 }

const results = await fetcher.run(slice);

results.length ==== 2;

results[0] === {
    "firstName": "Chilly",
    "lastName": "Willy",
    "country": "United States",
    "joinDate": "2015-10-10T10:13:09.157Z",
}

Stress Test and persistent mode

Example of a job using the data_generator to generate a persistent slice of 10,000 records for all 50 workers until the job is shutdown. This is useful for stress testing systems and down stream processes.

{
    "name" : "testing",
    "workers" : 50,
    "slicers" : 1,
    "lifecycle" : "persistent",
    "assets" : [
        "standard"
    ],
    "operations" : [
        {
            "_op": "data_generator",
            "stress_test": true,
            "size": 10000
        },
        {
            "_op": "noop"
        }
    ]
}

Example output from the above job:

const slice = { count: 1000 }

const results = await fetcher.run(slice);

results.length ==== 1000;

results[0] === {
    "ip": "1.12.146.136",
    "userAgent": "Mozilla/5.0 (Windows NT 5.2; WOW64; rv:8.9) Gecko/20100101 Firefox/8.9.9",
    "url": "https://gabrielle.org",
    "uuid": "408433ff-9495-4d1c-b066-7f9668b168f0",
    "ipv6": "8188:b9ad:d02d:d69e:5ca4:05e2:9aa5:23b0",
    "location": "-25.40587, 56.56418",
    "created": "2016-01-19T13:33:09.356-07:00",
    "bytes": 4850020
}

Generate controlled stream of data over a period of time

Example of a job using the data_generator processor to generate approximately 10,000 records per minute or 60,000 records per hour. Results could very as this is a loose approximation.

{
    "name" : "testing",
    "workers" : 1,
    "lifecycle" : "persistent",
    "assets" : [
        "standard"
    ],
    "operations" : [
        {
            "_op": "data_generator",
            "size": 5000,
            "delay": 30
        },
        {
            "_op": "noop"
        }
    ]
}

Parameters

Configuration Description Type Notes
_op Name of operation, it must reflect the exact name of the file String required
size If job lifecycle is set to once, then size is the total number of generated documents. If job lifecycle is set to persistent, then the generator will constantly stream data in chunks equal to the size Number required
json_schema File path to custom schema String optional, the schema must be exported Node style module.exports = schema
format Format of date in the timestamp field, options are dateNow, utcDate, utcBetween, isoBetween. See advanced configuration section for more details String optional, defaults to dateNow
start Start of date range String optional, only used with format isoBetween or utcBetween, defaults to Thu Jan 01 1970 00:00:00 GMT-0700 (MST)
end End of date range String optional, only used with format isoBetween or utcBetween, defaults to new Date()
stress_test If set to true, it will send non-unique documents following your schema as fast as possible. Helpful to determine downstream performance limits or constraints Boolean optional, defaults to false
delay Time in seconds that a worker will delay the completion of a slice. Good for generating controlled amounts of data within a loose time window. Number optional but can't be used when stress_test is set to true
date_key Name of they date field. If set, it will remove the created field on the default schema. String optional, defaults to created
set_id Sets an id field on each record whose value is formatted according the the option given. The options are base64url, hexadecimal, HEXADECIMAL String optional, it does not set any metadata fields, ie _key. See the set_key processor on how to set the _key in the metadata.
id_start_key Set if you would like to force the first part of the id to a certain character or set of characters Sting optional, must be used in tandem with set_id. id_start_key is essentially a regex. If you set it to "a", then the first character of the id will be "a", can also set ranges [a-f] or randomly alternate between b and a if its set to "[ab]"

Advanced Configuration

Description of date formats and time range options

Format Description
dateNow Formats dates in the ISO8601 specification, ie 2016-01-19T13:48:08.426-07:00, preserving local time. Values will be the current date and time.
isoBetween Uses the ISO8601 format, but date and time values are constrained by the start and end config settings.
utcDate Formats dates in the UTC specification, ie "2016-01-19T20:48:08.426Z". Values will be the current date and time.
utcBetween Uses the UTC format, but date and time values are constrained by the start and end config settings.

persistent mode

The data generator will continually stream data. In this mode the size value applies to the number of documents generated per slice instead of the total number of documents created as it does in once mode.