Skip to content

1904labs/docker-twitter-streamer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Streamer

This container provides a python script which can stream data from the twitter API to either stdout, kafka or kinesis.

Getting Started

To use his container you will need your API keys to access twitter. To get these head over to the twitter developer quick start page.
https://developer.twitter.com/en/docs/labs/filtered-stream/quick-start

Running the container

CLI option

To see the script options

$ docker run --rm 1904labs/twitter_stream:latest --help
usage: twitter_streamer.py [-h] [-f FILTER] [-k CONSUMER_KEY]
                           [-s CONSUMER_SECRET] [-a ACCESS_TOKEN]
                           [-t ACCESS_TOKEN_SECRET]

optional arguments:
  -h, --help            show this help message and exit
  -f FILTER, --filter FILTER
  -k CONSUMER_KEY, --consumer_key CONSUMER_KEY
  -s CONSUMER_SECRET, --consumer_secret CONSUMER_SECRET
  -a ACCESS_TOKEN, --access_token ACCESS_TOKEN
  -t ACCESS_TOKEN_SECRET, --access_token_secret ACCESS_TOKEN_SECRET

Stdout mode

Use the following command line to run the container with output going to the docker log.
docker run --rm 1904labs/twitter_stream:latest -f "#funny" -k <TWITTER_CONSUMER_KEY> -s <TWITTER_CONSUMER_SECRET> -a <TWITTER_ACCESS_TOKEN> -t <TWITTER_ACCESS_TOKEN_SECRET>

Send to Kinesis

In order to send the collected tweets to Kinesis, first you will need AWS credentials, and you wil need to set up the kinesis stream. Within the container, the following environment variables need to be set.

  • KINESIS_STREAM_NAME - required
  • KINESIS_REGION - optional, default=us-east1
  • KINESIS_API_NAME - optional, default=firehose

Send to Kafka

In order to send the collected tweets to Kafka, you will need to set the following environment variables:

  • KAFKA_BOOTSTRAP_SERVERS: 'host[:port]' string (or list of 'host[:port]' strings) that the producer should contact to bootstrap initial cluster metadata. This does not have to be the full node list. It just needs to have at least one broker that will respond to a Metadata API Request. Default port is 9092. If no servers are specified, will default to localhost:9092.

Optional Kafka env variables

  • KAFKA_CLIENT_ID (str): a name for this client. This string is passed in each request to servers and can be used to identify specific server-side log entries that correspond to this client. Default: 'kafka-python-producer-#' (appended with a unique number per instance)
  • KAFKA_KEY_SERIALIZER (callable): used to convert user-supplied keys to bytes If not None, called as f(key), should return bytes. Default: None.
  • KAFKA_VALUE_SERIALIZER (callable): used to convert user-supplied message values to bytes. If not None, called as f(value), should return bytes. Default: None.
  • KAFKA_ACKS (0, 1, 'all'): The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are common: 0: Producer will not wait for any acknowledgment from the server. The message will immediately be added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1. 1: Wait for leader to write the record to its local log only. Broker will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost. all: Wait for the full set of in-sync replicas to write the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. If unset, defaults to acks=1.
  • KAFKA_COMPRESSION_TYPE (str): The compression type for all data generated by the producer. Valid values are 'gzip', 'snappy', 'lz4', or None. Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression). Default: None.
  • KAFKA_RETRIES (int): Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max_in_flight_requests_per_connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Default: 0.
  • KAFKA_BATCH_SIZE (int): Requests sent to brokers will contain multiple batches, one for each partition with data available to be sent. A small batch size will make batching less common and may reduce throughput (a batch size of zero will disable batching entirely). Default: 16384
  • KAFKA_LINGER_MS (int): The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay; that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper bound on the delay for batching: once we get batch_size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay). Setting linger_ms=5 would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absence of load. Default: 0.
  • KAFKA_PARTITIONER (callable): Callable used to determine which partition each message is assigned to. Called (after key serialization): partitioner(key_bytes, all_partitions, available_partitions). The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages with the same key are assigned to the same partition. When a key is None, the message is delivered to a random partition (filtered to partitions with available leaders only, if possible).
  • KAFKA_BUFFER_MEMORY (int): The total bytes of memory the producer should use to buffer records waiting to be sent to the server. If records are sent faster than they can be delivered to the server the producer will block up to max_block_ms, raising an exception on timeout. In the current implementation, this setting is an approximation. Default: 33554432 (32MB)
  • KAFKA_CONNECTIONS_MAX_IDLE_MS: Close idle connections after the number of milliseconds specified by this config. The broker closes idle connections after connections.max.idle.ms, so this avoids hitting unexpected socket disconnected errors on the client. Default: 540000
  • KAFKA_MAX_BLOCK_MS (int): Number of milliseconds to block during :meth:~kafka.KafkaProducer.send and :meth:~kafka.KafkaProducer.partitions_for. These methods can be blocked either because the buffer is full or metadata unavailable. Blocking in the user-supplied serializers or partitioner will not be counted against this timeout. Default: 60000.
  • KAFKA_MAX_REQUEST_SIZE (int): The maximum size of a request. This is also effectively a cap on the maximum record size. Note that the server has its own cap on record size which may be different from this. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests. Default: 1048576.
  • KAFKA_METADATA_MAX_AGE_MS (int): The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions. Default: 300000
  • KAFKA_RETRY_BACKOFF_MS (int): Milliseconds to backoff when retrying on errors. Default: 100.
  • KAFKA_REQUEST_TIMEOUT_MS (int): Client request timeout in milliseconds. Default: 30000.
  • KAFKA_RECEIVE_BUFFER_BYTES (int): The size of the TCP receive buffer (SO_RCVBUF) to use when reading data. Default: None (relies on system defaults). Java client defaults to 32768.
  • KAFKA_SEND_BUFFER_BYTES (int): The size of the TCP send buffer (SO_SNDBUF) to use when sending data. Default: None (relies on system defaults). Java client defaults to 131072.
  • KAFKA_SOCKET_OPTIONS (list): List of tuple-arguments to socket.setsockopt to apply to broker connection sockets. Default: [(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)]
  • KAFKA_RECONNECT_BACKOFF_MS (int): The amount of time in milliseconds to wait before attempting to reconnect to a given host. Default: 50.
  • KAFKA_RECONNECT_BACKOFF_MAX_MS (int): The maximum amount of time in milliseconds to backoff/wait when reconnecting to a broker that has repeatedly failed to connect. If provided, the backoff per host will increase exponentially for each consecutive connection failure, up to this maximum. Once the maximum is reached, reconnection attempts will continue periodically with this fixed rate. To avoid connection storms, a randomization factor of 0.2 will be applied to the backoff resulting in a random range between 20% below and 20% above the computed value. Default: 1000.
  • KAFKA_MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION (int): Requests are pipelined to kafka brokers up to this number of maximum requests per broker connection. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled). Default: 5.
  • KAFKA_SECURITY_PROTOCOL (str): Protocol used to communicate with brokers. Valid values are: PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL. Default: PLAINTEXT.
  • KAFKA_SSL_CONTEXT (ssl.SSLContext): pre-configured SSLContext for wrapping socket connections. If provided, all other ssl_* configurations will be ignored. Default: None.
  • KAFKA_SSL_CHECK_HOSTNAME (bool): flag to configure whether ssl handshake should verify that the certificate matches the brokers hostname. default: true.
  • KAFKA_SSL_CAFILE (str): optional filename of ca file to use in certificate veriication. default: none.
  • KAFKA_SSL_CERTFILE (str): optional filename of file in pem format containing the client certificate, as well as any ca certificates needed to establish the certificate's authenticity. default: none.
  • KAFKA_SSL_KEYFILE (str): optional filename containing the client private key. default: none.
  • KAFKA_SSL_PASSWORD (str): optional password to be used when loading the certificate chain. default: none.
  • KAFKA_SSL_CRLFILE (str): optional filename containing the CRL to check for certificate expiration. By default, no CRL check is done. When providing a file, only the leaf certificate will be checked against this CRL. The CRL can only be checked with Python 3.4+ or 2.7.9+. default: none.
  • KAFKA_SSL_CIPHERS (str): optionally set the available ciphers for ssl connections. It should be a string in the OpenSSL cipher list format. If no cipher can be selected (because compile-time options or other configuration forbids use of all the specified ciphers), an ssl.SSLError will be raised. See ssl.SSLContext.set_ciphers
  • KAFKA_API_VERSION (tuple): Specify which Kafka API version to use. If set to None, the client will attempt to infer the broker version by probing various APIs. Example: (0, 10, 2). Default: None
  • KAFKA_API_VERSION_AUTO_TIMEOUT_MS (int): number of milliseconds to throw a timeout exception from the constructor when checking the broker api version. Only applies if api_version set to None.
  • KAFKA_METRIC_REPORTERS (list): A list of classes to use as metrics reporters. Implementing the AbstractMetricsReporter interface allows plugging in classes that will be notified of new metric creation. Default: []
  • KAFKA_METRICS_NUM_SAMPLES (int): The number of samples maintained to compute metrics. Default: 2
  • KAFKA_METRICS_SAMPLE_WINDOW_MS (int): The maximum age in milliseconds of samples used to compute metrics. Default: 30000
  • KAFKA_SELECTOR (selectors.BaseSelector): Provide a specific selector implementation to use for I/O multiplexing. Default: selectors.DefaultSelector
  • KAFKA_SASL_MECHANISM (str): Authentication mechanism when security_protocol is configured for SASL_PLAINTEXT or SASL_SSL. Valid values are: PLAIN, GSSAPI, OAUTHBEARER, SCRAM-SHA-256, SCRAM-SHA-512.
  • KAFKA_SASL_PLAIN_USERNAME (str): username for sasl PLAIN and SCRAM authentication. Required if sasl_mechanism is PLAIN or one of the SCRAM mechanisms.
  • KAFKA_SASL_PLAIN_PASSWORD (str): password for sasl PLAIN and SCRAM authentication. Required if sasl_mechanism is PLAIN or one of the SCRAM mechanisms.
  • KAFKA_SASL_KERBEROS_SERVICE_NAME (str): Service name to include in GSSAPI sasl mechanism handshake. Default: 'kafka'
  • KAFKA_SASL_KERBEROS_DOMAIN_NAME (str): kerberos domain name to use in GSSAPI sasl mechanism handshake. Default: one of bootstrap servers
  • KAFKA_SASL_OAUTH_TOKEN_PROVIDER (AbstractTokenProvider): OAuthBearer token provider instance. (See kafka.oauth.abstract). Default: None

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published