quick_merge

A utility to merge(full outer join or inner join) large number of files in an optimized manner.The algorithm capitalizes on the fact that full outer join is associative as well as commutative, so the order of merging would not change the final result.

Let's say we have 100 files (f1, f2, f3, .... f100) all of them indexed by id, all of which needs to be merged together to form one consolidate file. Insetead of merging them one after the other in a loop, the quick_merge follows these steps.

Add all the files f1, f2, f3, ... f100 to a heap(the datastructure).
Balance the heap, such that the top contains the one with minimum rowsize.
Pop the top 2 elements from the heap, merge(full outer join) them and push back again to the heap.
Repeat the process until we have just one file in the heap, which will be the final merged file.
Return the final merged file from heap.

Theoretically a full outer merge costs O(n*m), where n and m are number of rows in the files (without indexing). If we index, the same complexity reduces to O(n+m). So, if we have two small files in each step to merge, we save time.

⚠️ : The passed `id` should be unique for each row in a dataframe, else the result would depend on the order of merge.

Installation

Build from source.

git clone https://github.com/farziengineer/quick_merge.git
cd quick_merge
python setup.py install

Usage

import pandas as pd

from quick_merge import quick_merge_

df_1 = pd.DataFrame(data={"A" : [1, 2, 3, 4], "B" : [43, 54, 67, 23], "C" : [43, 56, 89, 12]})
df_2 = pd.DataFrame(data={"A" : [4, 5, 1, 6], "D" : [334, 34, 56, 12]})
df_3 = pd.DataFrame(data={"A" : [12, 1, 3, 4], "E" : [12, 54, 56, 67], "F" : [34, 56, 34, 12]})
df_1
   A   B   C
0  1  43  43
1  2  54  56
2  3  67  89
3  4  23  12

df_2

   A    D
0  4  334
1  5   34
2  1   56
3  6   12


df_3
    A   E   F
0  12  12  34
1   1  54  56
2   3  56  34
3   4  67  12


dfs = [df_1, df_2, df_3]

merged_df = quick_merge_(dataframes=dfs, merge_key="A")

merged_df

       B     C      D     E     F
A                                
1   43.0  43.0   56.0  54.0  56.0
2   54.0  56.0    NaN   NaN   NaN
3   67.0  89.0    NaN  56.0  34.0
4   23.0  12.0  334.0  67.0  12.0
5    NaN   NaN   34.0   NaN   NaN
6    NaN   NaN   12.0   NaN   NaN
12   NaN   NaN    NaN  12.0  34.0

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
quick_merge		quick_merge
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quick_merge

Installation

Usage

About

Releases

Packages

Languages

mohitanand001/quick_merge

Folders and files

Latest commit

History

Repository files navigation

quick_merge

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages