Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan migration from Sourcify DB v0 to v1: custom script vs direct SQL script #1635

Closed
marcocastignoli opened this issue Sep 17, 2024 · 7 comments
Assignees

Comments

@marcocastignoli
Copy link
Member

marcocastignoli commented Sep 17, 2024

With the new VerA schema (verifier-alliance/database-specs#15), we need to migrate existing compiled_contracts.sources to the new tables sources and compiled_contracts_sources. Ensuring data integrity and minimizing downtime are critical for a smooth transition.

Evaluate whether to develop a custom js script or to execute a direct migration using an SQL script.

@marcocastignoli marcocastignoli self-assigned this Sep 17, 2024
@marcocastignoli
Copy link
Member Author

marcocastignoli commented Sep 17, 2024

With this SQL script we migrate from compiled_contracts.sources to the new tables sources and compiled_contracts_sources

Some considerations:

  • This script uses CTEs: Common Table Expressions are temporary result sets in SQL that you can reference within a single SQL statement. They are defined using the WITH clause and can simplify complex queries, improve readability, and allow for recursive operations.
  • Unfortunately we cannot use keccak256 since it's not supported on GCP.
  • Executing the query below will migrate only the first 100 rows
  • The query below doesn't include removing the compiled_contracts.sources column
-- Begin the transaction
BEGIN;

-- Step 1: Extract and insert unique sources into 'sources' table
WITH extracted_sources AS (
    SELECT
        cc.id AS compilation_id,
        src.key AS path,
        src.value #>> '{}' AS content, -- Extracts the text without quotes
        digest(src.value #>> '{}' , 'sha256') AS source_hash,
        ''::bytea AS source_hash_keccak -- Ensure keccak256 function exists
    FROM
        compiled_contracts cc
    CROSS JOIN LATERAL
        jsonb_each(cc.sources::jsonb) AS src(key, value)
    ORDER BY cc.id  -- delete this line for the full migration
    LIMIT 100  -- delete this line for the full migration
),
inserted_sources AS (
    INSERT INTO sources (
        source_hash,
        source_hash_keccak,
        content
    )
    SELECT DISTINCT
        es.source_hash,
        es.source_hash_keccak,
        es.content
    FROM
        extracted_sources es
    ON CONFLICT (source_hash) DO NOTHING
    RETURNING source_hash
),
inserted_links AS (
    INSERT INTO compiled_contracts_sources (
        compilation_id,
        source_hash,
        path
    )
    SELECT
        es.compilation_id,
        es.source_hash,
        es.path
    FROM
        extracted_sources es
    ON CONFLICT (compilation_id, path) DO NOTHING
    RETURNING id
)
-- Main statement to execute the CTEs and retrieve summary counts
SELECT
    (SELECT COUNT(*) FROM inserted_sources) AS sources_inserted,
    (SELECT COUNT(*) FROM inserted_links) AS links_inserted;

-- Commit the transaction
COMMIT;

@marcocastignoli marcocastignoli changed the title Think about how we can move the data on Sourcify DB v0 schema to Sourcify DB v1: Do we need to run the re-verification or can we run a script to do directly with SQL queries? Plan migration from Sourcify DB v0 to v1: custom script vs direct SQL script Sep 17, 2024
@marcocastignoli
Copy link
Member Author

marcocastignoli commented Sep 17, 2024

On my machine it took 5 minutes to migrate 383'160 compiled_contracts to 148'865 sources and 2'659'011 compiled_contracts_sources. For comparison on the production Sourcify database we have 835'290 compiled_contracts.

@ethereum ethereum deleted a comment Sep 18, 2024
@marcocastignoli
Copy link
Member Author

Below are the steps of a possible migration from v0 to v1, using the query above, minimizing downtime, and without the need to maintain backward compatibility with v0.

  1. Prepare staging for schema v1 (with no backward compatibility to v0): Add support for the new tables sources and compiled_contracts_sources #1615
  2. When we are ready to deploy, start the migration query on production and migrate all compiled_contracts sources until TIMESTAMP_START_MIGRATION
  3. Wait for the migration to finish
  4. Shutdown Sourcify
  5. Start migrating missing contracts and deploy
    5.1. Start migrating compiled_contracts starting from TIMESTAMP_START_MIGRATION to TIMESTAMP_SHUTDOWN_SOURCIFY
    5.2. Meanwhile we can deploy the new sourcify version

Possible optimization:

  • In order to reduce the load on the sql instance we could prepare a small script to run the sql in batches

@kuzdogan
Copy link
Member

From what I see you are not deleting the compiled_contracts.sources, right?

Also not clear to me why you'd shutdown Sourcify shouldn't it be like this?

  1. Execute query to copy to new sources tables
  2. Merge the PR that works with new tables (like /files endpoints should use sources instead of compiled_contracts.sources). Also with that, Sourcify starts writing to the new tables
  3. Execute the query to copy the rows missed since the last execution
  4. Delete compiled_contracts.sources column (possible?)

@marcocastignoli
Copy link
Member Author

From what I see you are not deleting the compiled_contracts.sources, right?

Right, I didn't include this in the process: I think we can keep the table for some time just to be sure that everything is correct.

Also not clear to me why you'd shutdown Sourcify shouldn't it be like this?

Also right. I included that step to be clear about the timeframe of the second migration. It can be from TIMESTAMP_START_MIGRATION to TIMESTAMP_RELEASE

@marcocastignoli
Copy link
Member Author

Migration successfully executed on staging! Below you can find a list of steps / problems / fixes we encountered during this process.

Migration Process Fixes


Error during SQL script execution:

  • Reason:
    SQL Error [53400]: ERROR: temporary file size exceeds temp_file_limit (1537530kB)

  • Solution
    Set the flag temp_file_limit to the maximum on GCP's postgresql instance


Environment Weakness:

  • We realized the environments have low CPU / RAM so running the script is a bit slower.

Staging Execution:

  • sources_inserted | links_inserted
    • 158146 | 685767
  • It took ~27 minutes

PR Merge:

  • Successfully merged the PR to staging.

Schema Adjustment:

  • We had to set compiled_contracts.sources as nullable, otherwise we cannot keep the compiled_contracts.sources column.

Issue Found:

  • Reason:
    We were using keccak256 to find existing sources in the database. However, we should be using sha256.
  • Solution:
    Created this PR to address the hashing issue.

Checked if verification works:

  • Checked whether new contracts have their corresponding sources in the new table.

Migration Re-execution:

  • sources_inserted | links_inserted
    • 19 | 106

@marcocastignoli
Copy link
Member Author

Let's keep this as the issue to track the planning (and execution on staging) and use #1672 to track the execution on the production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants