Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication Job #6

Merged
merged 2 commits into from
May 19, 2020
Merged

Deduplication Job #6

merged 2 commits into from
May 19, 2020

Conversation

apeksharma
Copy link
Collaborator

@apeksharma apeksharma commented May 12, 2020

Detailed description:
Deduplication job will run as a Spring application.
It'll be triggered every 5 min (configurable) and will do the following:

  • Get state of last run
  • Check if there is new data to be deduplicated
    • Check if there are duplicates in new data
      • Deduplicated
    • Save new state

Changes:

  • Adds metrics to track deduplication job
  • Adds new column 'dedupe' to transactions table
  • Adds integration test
  • Adds create-tables.sh script to create all tables needed by hedera-etl
  • Rename hedera-etl-dataflow to hedera-etl-bigquery

Signed-off-by: Apekshit Sharma [email protected]

Special notes for your reviewer:
Documentation changes will be in followup

Checklist

  • Documentation added
  • Tests updated

@apeksharma
Copy link
Collaborator Author

apeksharma commented May 12, 2020

Dedupe integration test run looks as follows:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.hedera.dedupe.DedupeIntegrationTest
2020-05-12 12:33:26,937 INFO  [main ] o.s.b.t.c.SpringBootTestContextBootstrapper Neither @ContextConfiguration nor @ContextHierarchy found for test class [com.hedera.dedupe.DedupeIntegrationTest], using SpringBootContextLoader
2020-05-12 12:33:26,947 INFO  [main ] o.s.t.c.s.AbstractContextLoader Could not detect default resource locations for test class [com.hedera.dedupe.DedupeIntegrationTest]: no resource found for suffixes {-context.xml, Context.groovy}.
2020-05-12 12:33:26,949 INFO  [main ] o.s.t.c.s.AnnotationConfigContextLoaderUtils Could not detect default configuration classes for test class [com.hedera.dedupe.DedupeIntegrationTest]: DedupeIntegrationTest does not declare any static, non-private, non-final, nested classes annotated with @Configuration.
2020-05-12 12:33:27,117 INFO  [main ] o.s.b.t.c.SpringBootTestContextBootstrapper Found @SpringBootConfiguration com.hedera.dedupe.DedupeApplication for test class com.hedera.dedupe.DedupeIntegrationTest
2020-05-12 12:33:27,249 INFO  [main ] o.s.b.t.c.SpringBootTestContextBootstrapper Loaded default TestExecutionListener class names from location [META-INF/spring.factories]: [org.springframework.boot.test.mock.mockito.MockitoTestExecutionListener, org.springframework.boot.test.mock.mockito.ResetMocksTestExecutionListener, org.springframework.boot.test.autoconfigure.restdocs.RestDocsTestExecutionListener, org.springframework.boot.test.autoconfigure.web.client.MockRestServiceServerResetTestExecutionListener, org.springframework.boot.test.autoconfigure.web.servlet.MockMvcPrintOnlyOnFailureTestExecutionListener, org.springframework.boot.test.autoconfigure.web.servlet.WebDriverTestExecutionListener, org.springframework.test.context.web.ServletTestExecutionListener, org.springframework.test.context.support.DirtiesContextBeforeModesTestExecutionListener, org.springframework.test.context.support.DependencyInjectionTestExecutionListener, org.springframework.test.context.support.DirtiesContextTestExecutionListener, org.springframework.test.context.transaction.TransactionalTestExecutionListener, org.springframework.test.context.jdbc.SqlScriptsTestExecutionListener, org.springframework.test.context.event.EventPublishingTestExecutionListener]
2020-05-12 12:33:27,273 INFO  [main ] o.s.b.t.c.SpringBootTestContextBootstrapper Using TestExecutionListeners: [org.springframework.test.context.support.DirtiesContextBeforeModesTestExecutionListener@65d3a83, org.springframework.boot.test.mock.mockito.MockitoTestExecutionListener@7be9582e, org.springframework.boot.test.autoconfigure.SpringBootDependencyInjectionTestExecutionListener@73ac7716, org.springframework.test.context.support.DirtiesContextTestExecutionListener@125ace20, org.springframework.test.context.transaction.TransactionalTestExecutionListener@7dbc77ca, org.springframework.test.context.jdbc.SqlScriptsTestExecutionListener@4c25687b, org.springframework.test.context.event.EventPublishingTestExecutionListener@5c21b22e, org.springframework.boot.test.mock.mockito.ResetMocksTestExecutionListener@184e5c44, org.springframework.boot.test.autoconfigure.restdocs.RestDocsTestExecutionListener@6527aa0, org.springframework.boot.test.autoconfigure.web.client.MockRestServiceServerResetTestExecutionListener@6153aca1, org.springframework.boot.test.autoconfigure.web.servlet.MockMvcPrintOnlyOnFailureTestExecutionListener@30b2d267, org.springframework.boot.test.autoconfigure.web.servlet.WebDriverTestExecutionListener@5a740449]
2020-05-12 12:33:27,479 INFO  [background-preinit] o.h.v.i.u.Version HV000001: Hibernate Validator 6.0.19.Final

             @@@@@@@@@@@@@
          @@@@@@@@@@@@@@@@@@@
       @@@@@   0@@@@@@@@   @@@@@                 @@@        @@@                     @@@
     @@@@@@@   @@@@@@@@@   @@@@@@@               @@@        @@@                     @@@
    @@@@@@@@   @@@@@@@@@   @@@@@@@@              @@@        @@@    @@@@@@     @@@@@@@@@    @@@@@@    @@@@@@@@  @@@@@@
    @@@@@@@@               @@@@@@@@              @@@@@@@@@@@@@@  @@@    @@@  @@@    @@@  @@@    @@@  @@@@@          @@@
    @@@@@@@@   @@@@@@@@@   @@@@@@@@              @@@        @@@  @@@@@@@@@  @@@     @@@  8@@@@@@@@   @@@       @@@@@@@@
     @@@@@@@   @@@@@@@@@   @@@@@@@               @@@        @@@  @@@         @@@    @@@  @@@         @@@      @@@    @@
       @@@@@   @@@@@@@@@   @@@@@                 @@@        @@@    @@@@@@@     @@@@@@@@    @@@@@@@   @@@       @@@@@@@@
          @@@@@@@@@@@@@@@@@@@
             @@@@@@@@@@@@@                            BigQuery Deduplication

2020-05-12 12:33:27,814 INFO  [main ] c.h.d.DedupeIntegrationTest Starting DedupeIntegrationTest on apekshits-MBP with PID 67965 (started by appy in /Users/appy/git/hedera-etl/hedera-dedupe-bigquery)
2020-05-12 12:33:27,815 INFO  [main ] c.h.d.DedupeIntegrationTest No active profile set, falling back to default profiles: default
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/appy/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/appy/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.12.1/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
2020-05-12 12:33:28,711 INFO  [main ] o.s.i.c.DefaultConfiguringBeanFactoryPostProcessor No bean named 'errorChannel' has been explicitly defined. Therefore, a default PublishSubscribeChannel will be created.
2020-05-12 12:33:28,726 INFO  [main ] o.s.i.c.DefaultConfiguringBeanFactoryPostProcessor No bean named 'taskScheduler' has been explicitly defined. Therefore, a default ThreadPoolTaskScheduler will be created.
2020-05-12 12:33:28,737 INFO  [main ] o.s.i.c.DefaultConfiguringBeanFactoryPostProcessor No bean named 'integrationHeaderChannelRegistry' has been explicitly defined. Therefore, a default DefaultHeaderChannelRegistry will be created.
2020-05-12 12:33:28,923 INFO  [main ] o.s.c.s.PostProcessorRegistrationDelegate$BeanPostProcessorChecker Bean 'org.springframework.integration.config.IntegrationManagementConfiguration' of type [org.springframework.integration.config.IntegrationManagementConfiguration] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-05-12 12:33:28,959 INFO  [main ] o.s.c.s.PostProcessorRegistrationDelegate$BeanPostProcessorChecker Bean 'integrationChannelResolver' of type [org.springframework.integration.support.channel.BeanFactoryChannelResolver] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-05-12 12:33:28,964 INFO  [main ] o.s.c.s.PostProcessorRegistrationDelegate$BeanPostProcessorChecker Bean 'integrationDisposableAutoCreatedBeans' of type [org.springframework.integration.config.annotation.Disposables] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-05-12 12:33:29,266 INFO  [main ] o.s.c.g.a.c.GcpContextAutoConfiguration The default project ID is appy-dev-272020
2020-05-12 12:33:29,389 INFO  [main ] o.s.c.g.c.DefaultCredentialsProvider Default credentials provider for service account [email protected]
2020-05-12 12:33:29,389 INFO  [main ] o.s.c.g.c.DefaultCredentialsProvider Scopes in use by default credentials: [https://www.googleapis.com/auth/pubsub, https://www.googleapis.com/auth/spanner.admin, https://www.googleapis.com/auth/spanner.data, https://www.googleapis.com/auth/datastore, https://www.googleapis.com/auth/sqlservice.admin, https://www.googleapis.com/auth/devstorage.read_only, https://www.googleapis.com/auth/devstorage.read_write, https://www.googleapis.com/auth/cloudruntimeconfig, https://www.googleapis.com/auth/trace.append, https://www.googleapis.com/auth/cloud-platform, https://www.googleapis.com/auth/cloud-vision, https://www.googleapis.com/auth/bigquery]
2020-05-12 12:33:31,912 INFO  [main ] o.s.s.c.ThreadPoolTaskScheduler Initializing ExecutorService 'taskScheduler'
2020-05-12 12:33:32,270 INFO  [main ] o.s.i.e.EventDrivenConsumer Adding {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
2020-05-12 12:33:32,271 INFO  [main ] o.s.i.c.PublishSubscribeChannel Channel 'application.errorChannel' has 1 subscriber(s).
2020-05-12 12:33:32,271 INFO  [main ] o.s.i.e.EventDrivenConsumer started bean '_org.springframework.integration.errorLogger'
2020-05-12 12:33:32,284 INFO  [main ] c.h.d.DedupeIntegrationTest Started DedupeIntegrationTest in 4.959 seconds (JVM running for 6.796)
2020-05-12 12:33:32,771 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_reset_table_1589312012
2020-05-12 12:33:32,771 INFO  [main ] c.h.d.BigQueryHelper Query: DELETE FROM appy-dev-272020.dataset.transactions WHERE 1 = 1
2020-05-12 12:33:36,804 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1716ms, affected rows = 200
2020-05-12 12:33:36,806 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_save_state_1589312016
2020-05-12 12:33:36,806 INFO  [main ] c.h.d.BigQueryHelper Query: MERGE INTO appy-dev-272020.dataset.dedupe_state
USING (
  SELECT 'lastValidTimestamp' AS name, '0' AS value
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

2020-05-12 12:33:38,935 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1278ms, affected rows = 2
2020-05-12 12:33:38,946 INFO  [main ] c.h.d.DedupeIntegrationTest Inserting duplicated data: INSERT INTO appy-dev-272020.dataset.transactions (consensusTimestampTruncated, consensusTimestamp) VALUES ( '1970-01-01T00:00:00Z', 1 ), ( '1970-01-01T00:00:00.916891Z', 916891138 ), ( '1970-01-01T00:00:00.992693Z', 992693948 ), ( '1970-01-01T00:00:01.930518Z', 1930518607 ), ( '1970-01-01T00:00:02.461117Z', 2461117840 ), ( '1970-01-01T00:00:02.527357Z', 2527357118 ), ( '1970-01-01T00:00:02.527357Z', 2527357118 ), ( '1970-01-01T00:00:03.143612Z', 3143612749 ), ( '1970-01-01T00:00:03.357172Z', 3357172551 ), ( '1970-01-01T00:00:03.538871Z', 3538871403 ), ( '1970-01-01T00:00:03.884991Z', 3884991286 ), ( '1970-01-01T00:00:04.297324Z', 4297324303 ), ( '1970-01-01T00:00:04.297324Z', 4297324303 ), ( '1970-01-01T00:00:04.646591Z', 4646591071 ), ( '1970-01-01T00:00:04.678611Z', 4678611817 ), ( '1970-01-01T00:00:05.569715Z', 5569715736 ), ( '1970-01-01T00:00:06.070054Z', 6070054868 ), ( '1970-01-01T00:00:06.705903Z', 6705903193 ), ( '1970-01-01T00:00:06.705903Z', 6705903193 ), ( '1970-01-01T00:00:06.950647Z', 6950647219 ), ( '1970-01-01T00:00:07.640563Z', 7640563867 ), ( '1970-01-01T00:00:08.134506Z', 8134506638 ), ( '1970-01-01T00:00:08.705422Z', 8705422677 ), ( '1970-01-01T00:00:09.212018Z', 9212018744 ), ( '1970-01-01T00:00:09.212018Z', 9212018744 ), ( '1970-01-01T00:00:09.354612Z', 9354612648 ), ( '1970-01-01T00:00:09.757878Z', 9757878437 ), ( '1970-01-01T00:00:10.302396Z', 10302396789 ), ( '1970-01-01T00:00:10.846335Z', 10846335705 ), ( '1970-01-01T00:00:11.221432Z', 11221432424 ), ( '1970-01-01T00:00:11.221432Z', 11221432424 ), ( '1970-01-01T00:00:11.535797Z', 11535797725 ), ( '1970-01-01T00:00:11.916570Z', 11916570790 ), ( '1970-01-01T00:00:12.394134Z', 12394134733 ), ( '1970-01-01T00:00:12.986585Z', 12986585497 ), ( '1970-01-01T00:00:13.417562Z', 13417562648 ), ( '1970-01-01T00:00:13.417562Z', 13417562648 ), ( '1970-01-01T00:00:14.330689Z', 14330689777 ), ( '1970-01-01T00:00:15.118904Z', 15118904761 ), ( '1970-01-01T00:00:15.210917Z', 15210917194 ), ( '1970-01-01T00:00:16.168365Z', 16168365736 ), ( '1970-01-01T00:00:17.099121Z', 17099121399 ), ( '1970-01-01T00:00:17.099121Z', 17099121399 ), ( '1970-01-01T00:00:17.446714Z', 17446714449 ), ( '1970-01-01T00:00:17.673240Z', 17673240692 ), ( '1970-01-01T00:00:18.667402Z', 18667402716 ), ( '1970-01-01T00:00:19.595521Z', 19595521767 ), ( '1970-01-01T00:00:20.420059Z', 20420059520 ), ( '1970-01-01T00:00:20.420059Z', 20420059520 ), ( '1970-01-01T00:00:20.478784Z', 20478784256 ), ( '1970-01-01T00:00:21.137511Z', 21137511613 ), ( '1970-01-01T00:00:21.524666Z', 21524666564 ), ( '1970-01-01T00:00:21.957371Z', 21957371630 ), ( '1970-01-01T00:00:22.507285Z', 22507285673 ), ( '1970-01-01T00:00:22.507285Z', 22507285673 ), ( '1970-01-01T00:00:22.721096Z', 22721096181 ), ( '1970-01-01T00:00:22.763662Z', 22763662829 ), ( '1970-01-01T00:00:23.494346Z', 23494346331 ), ( '1970-01-01T00:00:23.948877Z', 23948877081 ), ( '1970-01-01T00:00:24.234809Z', 24234809658 ), ( '1970-01-01T00:00:24.234809Z', 24234809658 ), ( '1970-01-01T00:00:24.278703Z', 24278703359 ), ( '1970-01-01T00:00:24.860587Z', 24860587781 ), ( '1970-01-01T00:00:25.727490Z', 25727490640 ), ( '1970-01-01T00:00:25.856886Z', 25856886000 ), ( '1970-01-01T00:00:26.354028Z', 26354028231 ), ( '1970-01-01T00:00:26.354028Z', 26354028231 ), ( '1970-01-01T00:00:26.917796Z', 26917796597 ), ( '1970-01-01T00:00:27.495668Z', 27495668289 ), ( '1970-01-01T00:00:27.810305Z', 27810305715 ), ( '1970-01-01T00:00:28.178388Z', 28178388580 ), ( '1970-01-01T00:00:28.938889Z', 28938889495 ), ( '1970-01-01T00:00:28.938889Z', 28938889495 ), ( '1970-01-01T00:00:29.296505Z', 29296505439 ), ( '1970-01-01T00:00:29.584856Z', 29584856681 ), ( '1970-01-01T00:00:29.646820Z', 29646820895 ), ( '1970-01-01T00:00:30.490153Z', 30490153394 ), ( '1970-01-01T00:00:31.243198Z', 31243198628 ), ( '1970-01-01T00:00:31.243198Z', 31243198628 ), ( '1970-01-01T00:00:31.289459Z', 31289459063 ), ( '1970-01-01T00:00:32.209976Z', 32209976965 ), ( '1970-01-01T00:00:32.590409Z', 32590409396 ), ( '1970-01-01T00:00:33.123122Z', 33123122064 ), ( '1970-01-01T00:00:33.513719Z', 33513719688 ), ( '1970-01-01T00:00:33.513719Z', 33513719688 ), ( '1970-01-01T00:00:34.214406Z', 34214406956 ), ( '1970-01-01T00:00:35.183735Z', 35183735643 ), ( '1970-01-01T00:00:36.078799Z', 36078799524 ), ( '1970-01-01T00:00:37.038774Z', 37038774899 ), ( '1970-01-01T00:00:37.482587Z', 37482587834 ), ( '1970-01-01T00:00:37.482587Z', 37482587834 ), ( '1970-01-01T00:00:38.113606Z', 38113606828 ), ( '1970-01-01T00:00:38.615092Z', 38615092427 ), ( '1970-01-01T00:00:38.900888Z', 38900888955 ), ( '1970-01-01T00:00:39.521145Z', 39521145206 ), ( '1970-01-01T00:00:39.670451Z', 39670451474 ), ( '1970-01-01T00:00:39.670451Z', 39670451474 ), ( '1970-01-01T00:00:40.088350Z', 40088350773 ), ( '1970-01-01T00:00:40.729397Z', 40729397860 ), ( '1970-01-01T00:00:41.627148Z', 41627148483 ), ( '1970-01-01T00:00:41.993235Z', 41993235024 ), ( '1970-01-01T00:00:42.799223Z', 42799223918 ), ( '1970-01-01T00:00:42.799223Z', 42799223918 ), ( '1970-01-01T00:00:43.141105Z', 43141105012 ), ( '1970-01-01T00:00:43.924900Z', 43924900558 ), ( '1970-01-01T00:00:44.190167Z', 44190167516 ), ( '1970-01-01T00:00:45.189951Z', 45189951029 ), ( '1970-01-01T00:00:46.015664Z', 46015664761 ), ( '1970-01-01T00:00:46.015664Z', 46015664761 ), ( '1970-01-01T00:00:46.205142Z', 46205142826 ), ( '1970-01-01T00:00:46.819952Z', 46819952708 ), ( '1970-01-01T00:00:47.807108Z', 47807108926 ), ( '1970-01-01T00:00:47.990245Z', 47990245265 ), ( '1970-01-01T00:00:48.408475Z', 48408475880 ), ( '1970-01-01T00:00:48.408475Z', 48408475880 ), ( '1970-01-01T00:00:48.408919Z', 48408919273 ), ( '1970-01-01T00:00:49.398126Z', 49398126950 ), ( '1970-01-01T00:00:50.082658Z', 50082658180 ), ( '1970-01-01T00:00:50.120813Z', 50120813525 )
2020-05-12 12:33:38,947 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_insert_data_1589312018
2020-05-12 12:33:38,947 INFO  [main ] c.h.d.BigQueryHelper Query: INSERT INTO appy-dev-272020.dataset.transactions (consensusTimestampTruncated, consensusTimestamp) VALUES ( '1970-01-01T00:00:00Z', 1 ), ( '1970-01-01T00:00:00.916891Z', 916891138 ), ( '1970-01-01T00:00:00.992693Z', 992693948 ), ( '1970-01-01T00:00:01.930518Z', 1930518607 ), ( '1970-01-01T00:00:02.461117Z', 2461117840 ), ( '1970-01-01T00:00:02.527357Z', 2527357118 ), ( '1970-01-01T00:00:02.527357Z', 2527357118 ), ( '1970-01-01T00:00:03.143612Z', 3143612749 ), ( '1970-01-01T00:00:03.357172Z', 3357172551 ), ( '1970-01-01T00:00:03.538871Z', 3538871403 ), ( '1970-01-01T00:00:03.884991Z', 3884991286 ), ( '1970-01-01T00:00:04.297324Z', 4297324303 ), ( '1970-01-01T00:00:04.297324Z', 4297324303 ), ( '1970-01-01T00:00:04.646591Z', 4646591071 ), ( '1970-01-01T00:00:04.678611Z', 4678611817 ), ( '1970-01-01T00:00:05.569715Z', 5569715736 ), ( '1970-01-01T00:00:06.070054Z', 6070054868 ), ( '1970-01-01T00:00:06.705903Z', 6705903193 ), ( '1970-01-01T00:00:06.705903Z', 6705903193 ), ( '1970-01-01T00:00:06.950647Z', 6950647219 ), ( '1970-01-01T00:00:07.640563Z', 7640563867 ), ( '1970-01-01T00:00:08.134506Z', 8134506638 ), ( '1970-01-01T00:00:08.705422Z', 8705422677 ), ( '1970-01-01T00:00:09.212018Z', 9212018744 ), ( '1970-01-01T00:00:09.212018Z', 9212018744 ), ( '1970-01-01T00:00:09.354612Z', 9354612648 ), ( '1970-01-01T00:00:09.757878Z', 9757878437 ), ( '1970-01-01T00:00:10.302396Z', 10302396789 ), ( '1970-01-01T00:00:10.846335Z', 10846335705 ), ( '1970-01-01T00:00:11.221432Z', 11221432424 ), ( '1970-01-01T00:00:11.221432Z', 11221432424 ), ( '1970-01-01T00:00:11.535797Z', 11535797725 ), ( '1970-01-01T00:00:11.916570Z', 11916570790 ), ( '1970-01-01T00:00:12.394134Z', 12394134733 ), ( '1970-01-01T00:00:12.986585Z', 12986585497 ), ( '1970-01-01T00:00:13.417562Z', 13417562648 ), ( '1970-01-01T00:00:13.417562Z', 13417562648 ), ( '1970-01-01T00:00:14.330689Z', 14330689777 ), ( '1970-01-01T00:00:15.118904Z', 15118904761 ), ( '1970-01-01T00:00:15.210917Z', 15210917194 ), ( '1970-01-01T00:00:16.168365Z', 16168365736 ), ( '1970-01-01T00:00:17.099121Z', 17099121399 ), ( '1970-01-01T00:00:17.099121Z', 17099121399 ), ( '1970-01-01T00:00:17.446714Z', 17446714449 ), ( '1970-01-01T00:00:17.673240Z', 17673240692 ), ( '1970-01-01T00:00:18.667402Z', 18667402716 ), ( '1970-01-01T00:00:19.595521Z', 19595521767 ), ( '1970-01-01T00:00:20.420059Z', 20420059520 ), ( '1970-01-01T00:00:20.420059Z', 20420059520 ), ( '1970-01-01T00:00:20.478784Z', 20478784256 ), ( '1970-01-01T00:00:21.137511Z', 21137511613 ), ( '1970-01-01T00:00:21.524666Z', 21524666564 ), ( '1970-01-01T00:00:21.957371Z', 21957371630 ), ( '1970-01-01T00:00:22.507285Z', 22507285673 ), ( '1970-01-01T00:00:22.507285Z', 22507285673 ), ( '1970-01-01T00:00:22.721096Z', 22721096181 ), ( '1970-01-01T00:00:22.763662Z', 22763662829 ), ( '1970-01-01T00:00:23.494346Z', 23494346331 ), ( '1970-01-01T00:00:23.948877Z', 23948877081 ), ( '1970-01-01T00:00:24.234809Z', 24234809658 ), ( '1970-01-01T00:00:24.234809Z', 24234809658 ), ( '1970-01-01T00:00:24.278703Z', 24278703359 ), ( '1970-01-01T00:00:24.860587Z', 24860587781 ), ( '1970-01-01T00:00:25.727490Z', 25727490640 ), ( '1970-01-01T00:00:25.856886Z', 25856886000 ), ( '1970-01-01T00:00:26.354028Z', 26354028231 ), ( '1970-01-01T00:00:26.354028Z', 26354028231 ), ( '1970-01-01T00:00:26.917796Z', 26917796597 ), ( '1970-01-01T00:00:27.495668Z', 27495668289 ), ( '1970-01-01T00:00:27.810305Z', 27810305715 ), ( '1970-01-01T00:00:28.178388Z', 28178388580 ), ( '1970-01-01T00:00:28.938889Z', 28938889495 ), ( '1970-01-01T00:00:28.938889Z', 28938889495 ), ( '1970-01-01T00:00:29.296505Z', 29296505439 ), ( '1970-01-01T00:00:29.584856Z', 29584856681 ), ( '1970-01-01T00:00:29.646820Z', 29646820895 ), ( '1970-01-01T00:00:30.490153Z', 30490153394 ), ( '1970-01-01T00:00:31.243198Z', 31243198628 ), ( '1970-01-01T00:00:31.243198Z', 31243198628 ), ( '1970-01-01T00:00:31.289459Z', 31289459063 ), ( '1970-01-01T00:00:32.209976Z', 32209976965 ), ( '1970-01-01T00:00:32.590409Z', 32590409396 ), ( '1970-01-01T00:00:33.123122Z', 33123122064 ), ( '1970-01-01T00:00:33.513719Z', 33513719688 ), ( '1970-01-01T00:00:33.513719Z', 33513719688 ), ( '1970-01-01T00:00:34.214406Z', 34214406956 ), ( '1970-01-01T00:00:35.183735Z', 35183735643 ), ( '1970-01-01T00:00:36.078799Z', 36078799524 ), ( '1970-01-01T00:00:37.038774Z', 37038774899 ), ( '1970-01-01T00:00:37.482587Z', 37482587834 ), ( '1970-01-01T00:00:37.482587Z', 37482587834 ), ( '1970-01-01T00:00:38.113606Z', 38113606828 ), ( '1970-01-01T00:00:38.615092Z', 38615092427 ), ( '1970-01-01T00:00:38.900888Z', 38900888955 ), ( '1970-01-01T00:00:39.521145Z', 39521145206 ), ( '1970-01-01T00:00:39.670451Z', 39670451474 ), ( '1970-01-01T00:00:39.670451Z', 39670451474 ), ( '1970-01-01T00:00:40.088350Z', 40088350773 ), ( '1970-01-01T00:00:40.729397Z', 40729397860 ), ( '1970-01-01T00:00:41.627148Z', 41627148483 ), ( '1970-01-01T00:00:41.993235Z', 41993235024 ), ( '1970-01-01T00:00:42.799223Z', 42799223918 ), ( '1970-01-01T00:00:42.799223Z', 42799223918 ), ( '1970-01-01T00:00:43.141105Z', 43141105012 ), ( '1970-01-01T00:00:43.924900Z', 43924900558 ), ( '1970-01-01T00:00:44.190167Z', 44190167516 ), ( '1970-01-01T00:00:45.189951Z', 45189951029 ), ( '1970-01-01T00:00:46.015664Z', 46015664761 ), ( '1970-01-01T00:00:46.015664Z', 46015664761 ), ( '1970-01-01T00:00:46.205142Z', 46205142826 ), ( '1970-01-01T00:00:46.819952Z', 46819952708 ), ( '1970-01-01T00:00:47.807108Z', 47807108926 ), ( '1970-01-01T00:00:47.990245Z', 47990245265 ), ( '1970-01-01T00:00:48.408475Z', 48408475880 ), ( '1970-01-01T00:00:48.408475Z', 48408475880 ), ( '1970-01-01T00:00:48.408919Z', 48408919273 ), ( '1970-01-01T00:00:49.398126Z', 49398126950 ), ( '1970-01-01T00:00:50.082658Z', 50082658180 ), ( '1970-01-01T00:00:50.120813Z', 50120813525 )
2020-05-12 12:33:42,112 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1914ms, affected rows = 119
2020-05-12 12:33:42,387 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_flag_dml_accessible_rows_1589312022
2020-05-12 12:33:42,387 INFO  [main ] c.h.d.BigQueryHelper Query: UPDATE appy-dev-272020.dataset.transactions SET dedupe = 1 WHERE (consensusTimestamp >= 1 AND consensusTimestampTruncated >= '1970-01-01T00:00:00Z')
2020-05-12 12:33:45,710 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1958ms, affected rows = 119
2020-05-12 12:33:45,713 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_get_end_timestamp_1589312025
2020-05-12 12:33:45,714 INFO  [main ] c.h.d.BigQueryHelper Query: SELECT MAX(consensusTimestamp) AS ts
FROM appy-dev-272020.dataset.transactions
WHERE (consensusTimestamp >= 1 AND consensusTimestampTruncated >= '1970-01-01T00:00:00Z') AND dedupe IS NOT NULL
2020-05-12 12:33:47,602 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 505ms, affected rows = null
2020-05-12 12:33:47,602 INFO  [main ] c.h.d.DedupeRunner endTimestamp = 50120813525
2020-05-12 12:33:47,603 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_get_duplicates_1589312027
2020-05-12 12:33:47,603 INFO  [main ] c.h.d.BigQueryHelper Query: SELECT count(*) AS num, consensusTimestamp
  FROM appy-dev-272020.dataset.transactions
  WHERE (consensusTimestamp BETWEEN 1 AND 50120813525) AND (consensusTimestampTruncated BETWEEN '1970-01-01T00:00:00Z' AND '1970-01-01T00:00:50.120813Z')
  GROUP BY consensusTimestamp HAVING num > 1
2020-05-12 12:33:49,342 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 374ms, affected rows = null
2020-05-12 12:33:49,342 INFO  [main ] c.h.d.DedupeRunner Duplicates found
2020-05-12 12:33:49,342 INFO  [main ] c.h.d.DedupeRunner consensusTimestamp, count
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 22507285673, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 37482587834, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 17099121399, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 42799223918, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 46015664761, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 39670451474, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 11221432424, 2
2020-05-12 12:33:49,343 INFO  [main ] c.h.d.DedupeRunner 31243198628, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 20420059520, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 6705903193, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 33513719688, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 24234809658, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 13417562648, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 26354028231, 2
2020-05-12 12:33:49,344 INFO  [main ] c.h.d.DedupeRunner 4297324303, 2
2020-05-12 12:33:49,345 INFO  [main ] c.h.d.DedupeRunner 9212018744, 2
2020-05-12 12:33:49,345 INFO  [main ] c.h.d.DedupeRunner 2527357118, 2
2020-05-12 12:33:49,345 INFO  [main ] c.h.d.DedupeRunner 28938889495, 2
2020-05-12 12:33:49,345 INFO  [main ] c.h.d.DedupeRunner 48408475880, 2
2020-05-12 12:33:49,346 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_remove_duplicates_1589312029
2020-05-12 12:33:49,346 INFO  [main ] c.h.d.BigQueryHelper Query: MERGE INTO appy-dev-272020.dataset.transactions AS dest
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
    FROM appy-dev-272020.dataset.transactions AS original_data
    WHERE (consensusTimestamp BETWEEN 1 AND 50120813525) AND (consensusTimestampTruncated BETWEEN '1970-01-01T00:00:00Z' AND '1970-01-01T00:00:50.120813Z')
    GROUP BY consensusTimestamp
  )
) AS src
ON FALSE
WHEN NOT MATCHED BY SOURCE AND (dest.consensusTimestamp BETWEEN 1 AND 50120813525) AND (dest.consensusTimestampTruncated BETWEEN '1970-01-01T00:00:00Z' AND '1970-01-01T00:00:50.120813Z') -- remove all data in partition range
    THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
2020-05-12 12:33:53,600 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 2636ms, affected rows = 219
2020-05-12 12:33:53,603 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_save_state_1589312033
2020-05-12 12:33:53,603 INFO  [main ] c.h.d.BigQueryHelper Query: MERGE INTO appy-dev-272020.dataset.dedupe_state
USING (
  SELECT 'lastValidTimestamp' AS name, '50120813525' AS value
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

2020-05-12 12:33:56,208 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1411ms, affected rows = 2
2020-05-12 12:33:58,867 INFO  [main ] c.h.d.DedupeIntegrationTest Inserting duplicated data: INSERT INTO appy-dev-272020.dataset.transactions (consensusTimestampTruncated, consensusTimestamp) VALUES ( '1970-01-01T00:00:50.120813Z', 50120813526 ), ( '1970-01-01T00:00:50.210322Z', 50210322499 ), ( '1970-01-01T00:00:50.784141Z', 50784141652 ), ( '1970-01-01T00:00:51.428927Z', 51428927890 ), ( '1970-01-01T00:00:51.985411Z', 51985411246 ), ( '1970-01-01T00:00:52.192828Z', 52192828272 ), ( '1970-01-01T00:00:52.192828Z', 52192828272 ), ( '1970-01-01T00:00:52.555804Z', 52555804049 ), ( '1970-01-01T00:00:52.681952Z', 52681952270 ), ( '1970-01-01T00:00:53.252711Z', 53252711875 ), ( '1970-01-01T00:00:54.199591Z', 54199591757 ), ( '1970-01-01T00:00:54.670259Z', 54670259378 ), ( '1970-01-01T00:00:54.670259Z', 54670259378 ), ( '1970-01-01T00:00:55.342022Z', 55342022448 ), ( '1970-01-01T00:00:55.384621Z', 55384621893 ), ( '1970-01-01T00:00:56.108444Z', 56108444621 ), ( '1970-01-01T00:00:56.768122Z', 56768122814 ), ( '1970-01-01T00:00:56.875693Z', 56875693157 ), ( '1970-01-01T00:00:56.875693Z', 56875693157 ), ( '1970-01-01T00:00:57.409612Z', 57409612819 ), ( '1970-01-01T00:00:57.612849Z', 57612849955 ), ( '1970-01-01T00:00:58.367282Z', 58367282369 ), ( '1970-01-01T00:00:58.991936Z', 58991936779 ), ( '1970-01-01T00:00:59.775008Z', 59775008578 ), ( '1970-01-01T00:00:59.775008Z', 59775008578 ), ( '1970-01-01T00:01:00.124414Z', 60124414226 ), ( '1970-01-01T00:01:00.294636Z', 60294636881 ), ( '1970-01-01T00:01:00.995837Z', 60995837679 ), ( '1970-01-01T00:01:01.389024Z', 61389024571 ), ( '1970-01-01T00:01:01.747398Z', 61747398494 ), ( '1970-01-01T00:01:01.747398Z', 61747398494 ), ( '1970-01-01T00:01:02.738887Z', 62738887355 ), ( '1970-01-01T00:01:03.407357Z', 63407357738 ), ( '1970-01-01T00:01:03.829983Z', 63829983217 ), ( '1970-01-01T00:01:04.568141Z', 64568141980 ), ( '1970-01-01T00:01:05.215306Z', 65215306228 ), ( '1970-01-01T00:01:05.215306Z', 65215306228 ), ( '1970-01-01T00:01:05.820741Z', 65820741775 ), ( '1970-01-01T00:01:06.792279Z', 66792279552 ), ( '1970-01-01T00:01:07.728305Z', 67728305397 ), ( '1970-01-01T00:01:08.051426Z', 68051426108 ), ( '1970-01-01T00:01:08.457925Z', 68457925057 ), ( '1970-01-01T00:01:08.457925Z', 68457925057 ), ( '1970-01-01T00:01:08.543066Z', 68543066618 ), ( '1970-01-01T00:01:08.623373Z', 68623373611 ), ( '1970-01-01T00:01:08.867813Z', 68867813355 ), ( '1970-01-01T00:01:08.877942Z', 68877942015 ), ( '1970-01-01T00:01:09.127696Z', 69127696656 ), ( '1970-01-01T00:01:09.127696Z', 69127696656 ), ( '1970-01-01T00:01:09.514022Z', 69514022415 ), ( '1970-01-01T00:01:10.345981Z', 70345981269 ), ( '1970-01-01T00:01:10.424556Z', 70424556842 ), ( '1970-01-01T00:01:11.113057Z', 71113057936 ), ( '1970-01-01T00:01:11.263889Z', 71263889391 ), ( '1970-01-01T00:01:11.263889Z', 71263889391 ), ( '1970-01-01T00:01:12.108410Z', 72108410943 ), ( '1970-01-01T00:01:12.839886Z', 72839886203 ), ( '1970-01-01T00:01:13.005141Z', 73005141941 ), ( '1970-01-01T00:01:13.423284Z', 73423284096 ), ( '1970-01-01T00:01:13.946728Z', 73946728562 ), ( '1970-01-01T00:01:13.946728Z', 73946728562 ), ( '1970-01-01T00:01:14.265325Z', 74265325946 ), ( '1970-01-01T00:01:15.189135Z', 75189135193 ), ( '1970-01-01T00:01:16.145289Z', 76145289658 ), ( '1970-01-01T00:01:16.457426Z', 76457426405 ), ( '1970-01-01T00:01:16.710762Z', 76710762665 ), ( '1970-01-01T00:01:16.710762Z', 76710762665 ), ( '1970-01-01T00:01:17.109782Z', 77109782778 ), ( '1970-01-01T00:01:18.058080Z', 78058080305 ), ( '1970-01-01T00:01:18.152163Z', 78152163994 ), ( '1970-01-01T00:01:18.444032Z', 78444032459 ), ( '1970-01-01T00:01:18.493722Z', 78493722568 ), ( '1970-01-01T00:01:18.493722Z', 78493722568 ), ( '1970-01-01T00:01:18.536361Z', 78536361705 ), ( '1970-01-01T00:01:19.385185Z', 79385185091 ), ( '1970-01-01T00:01:20.016737Z', 80016737526 ), ( '1970-01-01T00:01:20.021915Z', 80021915501 ), ( '1970-01-01T00:01:20.925037Z', 80925037971 ), ( '1970-01-01T00:01:20.925037Z', 80925037971 ), ( '1970-01-01T00:01:21.035152Z', 81035152821 ), ( '1970-01-01T00:01:21.742668Z', 81742668987 ), ( '1970-01-01T00:01:22.284677Z', 82284677880 ), ( '1970-01-01T00:01:22.637254Z', 82637254337 ), ( '1970-01-01T00:01:23.334900Z', 83334900543 ), ( '1970-01-01T00:01:23.334900Z', 83334900543 ), ( '1970-01-01T00:01:23.499778Z', 83499778125 ), ( '1970-01-01T00:01:23.931931Z', 83931931868 ), ( '1970-01-01T00:01:24.579169Z', 84579169487 ), ( '1970-01-01T00:01:24.874490Z', 84874490380 ), ( '1970-01-01T00:01:25.307360Z', 85307360121 ), ( '1970-01-01T00:01:25.307360Z', 85307360121 ), ( '1970-01-01T00:01:25.699484Z', 85699484143 ), ( '1970-01-01T00:01:26.302778Z', 86302778292 ), ( '1970-01-01T00:01:26.837442Z', 86837442464 ), ( '1970-01-01T00:01:27.733934Z', 87733934572 ), ( '1970-01-01T00:01:27.903974Z', 87903974465 ), ( '1970-01-01T00:01:27.903974Z', 87903974465 ), ( '1970-01-01T00:01:28.432591Z', 88432591838 ), ( '1970-01-01T00:01:29.020534Z', 89020534217 ), ( '1970-01-01T00:01:29.606894Z', 89606894880 ), ( '1970-01-01T00:01:30.054026Z', 90054026369 ), ( '1970-01-01T00:01:30.453324Z', 90453324013 ), ( '1970-01-01T00:01:30.453324Z', 90453324013 ), ( '1970-01-01T00:01:30.853340Z', 90853340815 ), ( '1970-01-01T00:01:31.687082Z', 91687082843 ), ( '1970-01-01T00:01:32.299704Z', 92299704433 ), ( '1970-01-01T00:01:33.114594Z', 93114594919 ), ( '1970-01-01T00:01:33.374108Z', 93374108854 ), ( '1970-01-01T00:01:33.374108Z', 93374108854 ), ( '1970-01-01T00:01:33.818465Z', 93818465853 ), ( '1970-01-01T00:01:34.337178Z', 94337178698 ), ( '1970-01-01T00:01:35.036946Z', 95036946210 ), ( '1970-01-01T00:01:35.767630Z', 95767630154 ), ( '1970-01-01T00:01:36.257397Z', 96257397717 ), ( '1970-01-01T00:01:36.257397Z', 96257397717 ), ( '1970-01-01T00:01:36.483915Z', 96483915980 ), ( '1970-01-01T00:01:37.418291Z', 97418291009 ), ( '1970-01-01T00:01:38.189366Z', 98189366824 ), ( '1970-01-01T00:01:38.495816Z', 98495816381 )
2020-05-12 12:33:58,870 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_insert_data_1589312038
2020-05-12 12:33:58,870 INFO  [main ] c.h.d.BigQueryHelper Query: INSERT INTO appy-dev-272020.dataset.transactions (consensusTimestampTruncated, consensusTimestamp) VALUES ( '1970-01-01T00:00:50.120813Z', 50120813526 ), ( '1970-01-01T00:00:50.210322Z', 50210322499 ), ( '1970-01-01T00:00:50.784141Z', 50784141652 ), ( '1970-01-01T00:00:51.428927Z', 51428927890 ), ( '1970-01-01T00:00:51.985411Z', 51985411246 ), ( '1970-01-01T00:00:52.192828Z', 52192828272 ), ( '1970-01-01T00:00:52.192828Z', 52192828272 ), ( '1970-01-01T00:00:52.555804Z', 52555804049 ), ( '1970-01-01T00:00:52.681952Z', 52681952270 ), ( '1970-01-01T00:00:53.252711Z', 53252711875 ), ( '1970-01-01T00:00:54.199591Z', 54199591757 ), ( '1970-01-01T00:00:54.670259Z', 54670259378 ), ( '1970-01-01T00:00:54.670259Z', 54670259378 ), ( '1970-01-01T00:00:55.342022Z', 55342022448 ), ( '1970-01-01T00:00:55.384621Z', 55384621893 ), ( '1970-01-01T00:00:56.108444Z', 56108444621 ), ( '1970-01-01T00:00:56.768122Z', 56768122814 ), ( '1970-01-01T00:00:56.875693Z', 56875693157 ), ( '1970-01-01T00:00:56.875693Z', 56875693157 ), ( '1970-01-01T00:00:57.409612Z', 57409612819 ), ( '1970-01-01T00:00:57.612849Z', 57612849955 ), ( '1970-01-01T00:00:58.367282Z', 58367282369 ), ( '1970-01-01T00:00:58.991936Z', 58991936779 ), ( '1970-01-01T00:00:59.775008Z', 59775008578 ), ( '1970-01-01T00:00:59.775008Z', 59775008578 ), ( '1970-01-01T00:01:00.124414Z', 60124414226 ), ( '1970-01-01T00:01:00.294636Z', 60294636881 ), ( '1970-01-01T00:01:00.995837Z', 60995837679 ), ( '1970-01-01T00:01:01.389024Z', 61389024571 ), ( '1970-01-01T00:01:01.747398Z', 61747398494 ), ( '1970-01-01T00:01:01.747398Z', 61747398494 ), ( '1970-01-01T00:01:02.738887Z', 62738887355 ), ( '1970-01-01T00:01:03.407357Z', 63407357738 ), ( '1970-01-01T00:01:03.829983Z', 63829983217 ), ( '1970-01-01T00:01:04.568141Z', 64568141980 ), ( '1970-01-01T00:01:05.215306Z', 65215306228 ), ( '1970-01-01T00:01:05.215306Z', 65215306228 ), ( '1970-01-01T00:01:05.820741Z', 65820741775 ), ( '1970-01-01T00:01:06.792279Z', 66792279552 ), ( '1970-01-01T00:01:07.728305Z', 67728305397 ), ( '1970-01-01T00:01:08.051426Z', 68051426108 ), ( '1970-01-01T00:01:08.457925Z', 68457925057 ), ( '1970-01-01T00:01:08.457925Z', 68457925057 ), ( '1970-01-01T00:01:08.543066Z', 68543066618 ), ( '1970-01-01T00:01:08.623373Z', 68623373611 ), ( '1970-01-01T00:01:08.867813Z', 68867813355 ), ( '1970-01-01T00:01:08.877942Z', 68877942015 ), ( '1970-01-01T00:01:09.127696Z', 69127696656 ), ( '1970-01-01T00:01:09.127696Z', 69127696656 ), ( '1970-01-01T00:01:09.514022Z', 69514022415 ), ( '1970-01-01T00:01:10.345981Z', 70345981269 ), ( '1970-01-01T00:01:10.424556Z', 70424556842 ), ( '1970-01-01T00:01:11.113057Z', 71113057936 ), ( '1970-01-01T00:01:11.263889Z', 71263889391 ), ( '1970-01-01T00:01:11.263889Z', 71263889391 ), ( '1970-01-01T00:01:12.108410Z', 72108410943 ), ( '1970-01-01T00:01:12.839886Z', 72839886203 ), ( '1970-01-01T00:01:13.005141Z', 73005141941 ), ( '1970-01-01T00:01:13.423284Z', 73423284096 ), ( '1970-01-01T00:01:13.946728Z', 73946728562 ), ( '1970-01-01T00:01:13.946728Z', 73946728562 ), ( '1970-01-01T00:01:14.265325Z', 74265325946 ), ( '1970-01-01T00:01:15.189135Z', 75189135193 ), ( '1970-01-01T00:01:16.145289Z', 76145289658 ), ( '1970-01-01T00:01:16.457426Z', 76457426405 ), ( '1970-01-01T00:01:16.710762Z', 76710762665 ), ( '1970-01-01T00:01:16.710762Z', 76710762665 ), ( '1970-01-01T00:01:17.109782Z', 77109782778 ), ( '1970-01-01T00:01:18.058080Z', 78058080305 ), ( '1970-01-01T00:01:18.152163Z', 78152163994 ), ( '1970-01-01T00:01:18.444032Z', 78444032459 ), ( '1970-01-01T00:01:18.493722Z', 78493722568 ), ( '1970-01-01T00:01:18.493722Z', 78493722568 ), ( '1970-01-01T00:01:18.536361Z', 78536361705 ), ( '1970-01-01T00:01:19.385185Z', 79385185091 ), ( '1970-01-01T00:01:20.016737Z', 80016737526 ), ( '1970-01-01T00:01:20.021915Z', 80021915501 ), ( '1970-01-01T00:01:20.925037Z', 80925037971 ), ( '1970-01-01T00:01:20.925037Z', 80925037971 ), ( '1970-01-01T00:01:21.035152Z', 81035152821 ), ( '1970-01-01T00:01:21.742668Z', 81742668987 ), ( '1970-01-01T00:01:22.284677Z', 82284677880 ), ( '1970-01-01T00:01:22.637254Z', 82637254337 ), ( '1970-01-01T00:01:23.334900Z', 83334900543 ), ( '1970-01-01T00:01:23.334900Z', 83334900543 ), ( '1970-01-01T00:01:23.499778Z', 83499778125 ), ( '1970-01-01T00:01:23.931931Z', 83931931868 ), ( '1970-01-01T00:01:24.579169Z', 84579169487 ), ( '1970-01-01T00:01:24.874490Z', 84874490380 ), ( '1970-01-01T00:01:25.307360Z', 85307360121 ), ( '1970-01-01T00:01:25.307360Z', 85307360121 ), ( '1970-01-01T00:01:25.699484Z', 85699484143 ), ( '1970-01-01T00:01:26.302778Z', 86302778292 ), ( '1970-01-01T00:01:26.837442Z', 86837442464 ), ( '1970-01-01T00:01:27.733934Z', 87733934572 ), ( '1970-01-01T00:01:27.903974Z', 87903974465 ), ( '1970-01-01T00:01:27.903974Z', 87903974465 ), ( '1970-01-01T00:01:28.432591Z', 88432591838 ), ( '1970-01-01T00:01:29.020534Z', 89020534217 ), ( '1970-01-01T00:01:29.606894Z', 89606894880 ), ( '1970-01-01T00:01:30.054026Z', 90054026369 ), ( '1970-01-01T00:01:30.453324Z', 90453324013 ), ( '1970-01-01T00:01:30.453324Z', 90453324013 ), ( '1970-01-01T00:01:30.853340Z', 90853340815 ), ( '1970-01-01T00:01:31.687082Z', 91687082843 ), ( '1970-01-01T00:01:32.299704Z', 92299704433 ), ( '1970-01-01T00:01:33.114594Z', 93114594919 ), ( '1970-01-01T00:01:33.374108Z', 93374108854 ), ( '1970-01-01T00:01:33.374108Z', 93374108854 ), ( '1970-01-01T00:01:33.818465Z', 93818465853 ), ( '1970-01-01T00:01:34.337178Z', 94337178698 ), ( '1970-01-01T00:01:35.036946Z', 95036946210 ), ( '1970-01-01T00:01:35.767630Z', 95767630154 ), ( '1970-01-01T00:01:36.257397Z', 96257397717 ), ( '1970-01-01T00:01:36.257397Z', 96257397717 ), ( '1970-01-01T00:01:36.483915Z', 96483915980 ), ( '1970-01-01T00:01:37.418291Z', 97418291009 ), ( '1970-01-01T00:01:38.189366Z', 98189366824 ), ( '1970-01-01T00:01:38.495816Z', 98495816381 )
2020-05-12 12:34:01,775 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1623ms, affected rows = 119
2020-05-12 12:34:02,015 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_flag_dml_accessible_rows_1589312042
2020-05-12 12:34:02,015 INFO  [main ] c.h.d.BigQueryHelper Query: UPDATE appy-dev-272020.dataset.transactions SET dedupe = 1 WHERE (consensusTimestamp >= 50120813526 AND consensusTimestampTruncated >= '1970-01-01T00:00:50.120813Z')
2020-05-12 12:34:05,699 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 2326ms, affected rows = 119
2020-05-12 12:34:05,700 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_get_end_timestamp_1589312045
2020-05-12 12:34:05,700 INFO  [main ] c.h.d.BigQueryHelper Query: SELECT MAX(consensusTimestamp) AS ts
FROM appy-dev-272020.dataset.transactions
WHERE (consensusTimestamp >= 50120813526 AND consensusTimestampTruncated >= '1970-01-01T00:00:50.120813Z') AND dedupe IS NOT NULL
2020-05-12 12:34:07,395 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 409ms, affected rows = null
2020-05-12 12:34:07,395 INFO  [main ] c.h.d.DedupeRunner endTimestamp = 98495816381
2020-05-12 12:34:07,396 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_get_duplicates_1589312047
2020-05-12 12:34:07,396 INFO  [main ] c.h.d.BigQueryHelper Query: SELECT count(*) AS num, consensusTimestamp
  FROM appy-dev-272020.dataset.transactions
  WHERE (consensusTimestamp BETWEEN 50120813526 AND 98495816381) AND (consensusTimestampTruncated BETWEEN '1970-01-01T00:00:50.120813Z' AND '1970-01-01T00:01:38.495816Z')
  GROUP BY consensusTimestamp HAVING num > 1
2020-05-12 12:34:09,593 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 694ms, affected rows = null
2020-05-12 12:34:09,593 INFO  [main ] c.h.d.DedupeRunner Duplicates found
2020-05-12 12:34:09,593 INFO  [main ] c.h.d.DedupeRunner consensusTimestamp, count
2020-05-12 12:34:09,593 INFO  [main ] c.h.d.DedupeRunner 69127696656, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 78493722568, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 59775008578, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 52192828272, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 65215306228, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 90453324013, 2
2020-05-12 12:34:09,594 INFO  [main ] c.h.d.DedupeRunner 73946728562, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 76710762665, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 80925037971, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 71263889391, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 93374108854, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 54670259378, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 68457925057, 2
2020-05-12 12:34:09,595 INFO  [main ] c.h.d.DedupeRunner 61747398494, 2
2020-05-12 12:34:09,596 INFO  [main ] c.h.d.DedupeRunner 85307360121, 2
2020-05-12 12:34:09,596 INFO  [main ] c.h.d.DedupeRunner 87903974465, 2
2020-05-12 12:34:09,596 INFO  [main ] c.h.d.DedupeRunner 83334900543, 2
2020-05-12 12:34:09,596 INFO  [main ] c.h.d.DedupeRunner 96257397717, 2
2020-05-12 12:34:09,596 INFO  [main ] c.h.d.DedupeRunner 56875693157, 2
2020-05-12 12:34:09,598 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_remove_duplicates_1589312049
2020-05-12 12:34:09,598 INFO  [main ] c.h.d.BigQueryHelper Query: MERGE INTO appy-dev-272020.dataset.transactions AS dest
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
    FROM appy-dev-272020.dataset.transactions AS original_data
    WHERE (consensusTimestamp BETWEEN 50120813526 AND 98495816381) AND (consensusTimestampTruncated BETWEEN '1970-01-01T00:00:50.120813Z' AND '1970-01-01T00:01:38.495816Z')
    GROUP BY consensusTimestamp
  )
) AS src
ON FALSE
WHEN NOT MATCHED BY SOURCE AND (dest.consensusTimestamp BETWEEN 50120813526 AND 98495816381) AND (dest.consensusTimestampTruncated BETWEEN '1970-01-01T00:00:50.120813Z' AND '1970-01-01T00:01:38.495816Z') -- remove all data in partition range
    THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
2020-05-12 12:34:14,466 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 3385ms, affected rows = 219
2020-05-12 12:34:14,467 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_save_state_1589312054
2020-05-12 12:34:14,467 INFO  [main ] c.h.d.BigQueryHelper Query: MERGE INTO appy-dev-272020.dataset.dedupe_state
USING (
  SELECT 'lastValidTimestamp' AS name, '98495816381' AS value
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

2020-05-12 12:34:16,845 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 1385ms, affected rows = 2
2020-05-12 12:34:18,992 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_flag_dml_accessible_rows_1589312058
2020-05-12 12:34:18,992 INFO  [main ] c.h.d.BigQueryHelper Query: UPDATE appy-dev-272020.dataset.transactions SET dedupe = 1 WHERE (consensusTimestamp >= 98495816382 AND consensusTimestampTruncated >= '1970-01-01T00:01:38.495816Z')
2020-05-12 12:34:23,164 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 2861ms, affected rows = 0
2020-05-12 12:34:23,164 INFO  [main ] c.h.d.BigQueryHelper ### Starting job dedupe_get_end_timestamp_1589312063
2020-05-12 12:34:23,164 INFO  [main ] c.h.d.BigQueryHelper Query: SELECT MAX(consensusTimestamp) AS ts
FROM appy-dev-272020.dataset.transactions
WHERE (consensusTimestamp >= 98495816382 AND consensusTimestampTruncated >= '1970-01-01T00:01:38.495816Z') AND dedupe IS NOT NULL
2020-05-12 12:34:25,021 INFO  [main ] c.h.d.DedupeMetrics Job stats: runtime = 480ms, affected rows = null
2020-05-12 12:34:25,021 INFO  [main ] c.h.d.DedupeRunner No new rows
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 60.857 s - in com.hedera.dedupe.DedupeIntegrationTest
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO]

Copy link
Member

@medvedev1088 medvedev1088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions:

  • Where are you going to run it? A VM, GKE or something else?

  • What are the reasons you chose to do it with Spring rather than Cloud Composer? I guess because Cloud Composer is a bit pricy?

@apeksharma
Copy link
Collaborator Author

Where are you going to run it? A VM, GKE or something else?

In a VM. Devops provisioned an instance to run the Importer, planning to use the same for this.

What are the reasons you chose to do it with Spring rather than Cloud Composer? I guess because Cloud Composer is a bit pricy?

I explored following options:
Cloud Composer: It's great for workflow pipelines, but seemed too big of a cannon for a single periodic job where no workflows are involved.
Scheduler + Run

But it was much easier to have a single java process doing

while true:
  do job;
  sleep X

So went with that for v1. Can always upgrade in future.

Is that okay with you?

@medvedev1088
Copy link
Member

medvedev1088 commented May 18, 2020

So went with that for v1. Can always upgrade in future.

Is that okay with you?

That's ok. As you said we can upgrade later if necessary.

Make sure there are health checks setup for the machine, disk space and memory monitoring, and logging.

Deduplication job will run as a Spring application.
It'll be triggered every 5 min (configurable) and will do the following:
- Get state of last run
- Check if there is new data to be deduplicated
  - Check if there are duplicates in new data
    - Deduplicated
  - Save new state

Changes:
- Adds metrics to track deduplication job
- Adds new column 'dedupe' to transactions table
- Adds integration test
- Adds create-tables.sh script to create all tables needed by hedera-etl

Signed-off-by: Apekshit Sharma <[email protected]>
@apeksharma apeksharma mentioned this pull request May 19, 2020
@apeksharma
Copy link
Collaborator Author

Make sure there are health checks setup for the machine, disk space and memory monitoring, and logging.

absolutely. tracking here #9 .

@apeksharma
Copy link
Collaborator Author

Hey, if you don't mind, can we please merge this in and do further changes in smaller PRs?

Signed-off-by: Apekshit Sharma <[email protected]>
@medvedev1088
Copy link
Member

Hey, if you don't mind, can we please merge this in and do further changes in smaller PRs?

Sure thing.

@apeksharma apeksharma merged commit 45a6e26 into master May 19, 2020
@apeksharma apeksharma deleted the dedupe branch May 19, 2020 06:55
@apeksharma apeksharma mentioned this pull request May 23, 2020
2 tasks
apeksharma added a commit that referenced this pull request May 23, 2020
- Scripts to make dedupe deployment easier
- Fix bug in schema which resulted in the failed insert (#16)
- Added metric to dataflow job to track ingestionDelay from consensusTimestamp (#6)
- Add transaction_types table (#17)
- consistent naming of schema files
- Keep all schema files together. Move to a better location in future.

Signed-off-by: Apekshit Sharma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants