Skip to content

Commit

Permalink
QTM1 Code (#4)
Browse files Browse the repository at this point in the history
* Pkg updates with implementation of Task 1 and 2

* Update DESCRIPTION

Authors added

* Update sample data path and added authors

* Remove R from man

* gitignore updated

* DESCRIPTION add httr dependence

* Add functions
classEndpoin
correspondenceList
prefixList
retrieveClassificationTable
retrieveCorrespondenceTable
structureData

* rm library

* Update functions

* Documentation functions

* Update functions description and first version of vignettes

* rm vignettes

* Update documentation classification table and vignettes

* Update vignettes

* htlm vignettes

* Update documentation

* update vignette

* update vignette

* update vignettes

* Add CSVout parameter to both retrieve function and update documentation and vignette

* add import httr

* remove tables

* -add ShowQUERY argument
-change functions names (dataStructure and Classification table)
-add columns and change orders to correspondence tables
-remove prefix duplicates
-final docs
-others

* Final version (there might be error to check)

* vignette corrected

* quiete messages sparql query

* correct typos

* Update final changes

* Updated vignette

* add ALL as default to classificationEndpoint

* change doc for classificationEndpoint

* remove table

* Functions task 4

* Upload drafted documentation

* task 4 draft

* try function classification QC

* UPDATE

* update

* Update view is an issue

* update

* Update

Corrections for the rule 7 8 for Classification

* Modify the function remove the link for both csv file on rule 7 & 8

* Modify the function remove the link for both csv file on rule 7 & 8

* change

* add rmd and vignette

* modification of the commentatry on the different function

* Update of the function Correspondences Table add update from Classification QC and add new function AnalyseCorrespondenceQC

* put FAO in commentaire because the query has change so he will product some error message

* MODIFICATION

* remove because don't need it now

* ZREMOVE

* Update

* remove

* Rename the name of the variable on the explanation part

* Remove the correction for the data

* Vignette of the task 5

* updated

* updated

* modification and add the vignette

* REMOVE

* Modify the comment

* add the vignette for the analyse CorrespondenceTable

* Update csvout

* Modify the reader for the csv

* Modify the imput of the CSV and put the good argument for the function

* modify the path for reading csv

* Modify the csvoutput the user can put his own path to create his own csv file

* Update it

* Add one column and allow the user to put a path

* update

* update

* mistake

* white space

* testing

* Modify the example remove all read with dataframe.
Modify the commentary of the function
Add new example on the vignettes

* General update

* Next version

* Modify the example to use just a path

* Create a csvFile when running the function of ClassificationQC.

* Modify the comment

* mistake

* change it

* remove a details

* correction

* fix issue

* push

* remove comments on example

* Uopdate verification on the different csv file

Fixed issue on the duplicate and
about the 3 error message

* remove text

* Modification of the Query the old one retrieve nothing

* Erreur de saisie sur les exemples

* forget the parameters Level

* fixed

* Modifiy the bulle

* deleted

* removed

* New version

* erreur de frappe

* FORGET TO PUT IN COMMENT

* modification of the query and had a function to see the output clearly "cat"

* Modification of the order of the output

* changes the example for retrieveCorrespondenceTable to retrieve data

* Put it in comment take too much time and gave an error

* rework of the checking from the csv file for lenghts sequencing multiplecode

And rework for the CSVOUT

* Erreur de saisie

* fix

* push

* Adding data and example for the vignette ClassificationQc

* remove some weird comment

* update

* add more comment because it's confusing

* remove

* Testing QcOutput

* add gsub to obtain a clean csv file

* Adding the output for the different rules

* Testing RetrieveClassification() with other languages

* modification of the query and adapt the code in order to retrieve the prefix and conceptScheme

* Adding an option for the user if he put his own csv the function will write the output inside

* changing the example with an other language

* Adding a real exaample with singleChildCode

* Correcting the querry for FAO and R005 FIXED

* adding

* Testing the vignette with another language

* Changing the format

* Modification of the security for the sequencing file and modification for the Sequencing part

* Add another example for singleCode

* adding tools & 9.6

* un import en trop

* Example Task 6

* Remove comment

* remowe whitespace

* remove

* removed

* Add tasking 6

* REMOVE EXAMPLE

* new version

* Modification of the sequencing argument the user now can put the different levels for testing the sequencing

* Adding  aggregate CorrespondanceTable

* Update the description

* adding

* Adding Example for ClassificationQC

* New version with example

* remove na.omit

* modify commentary and fix the csv file

* remove view

* Issue fixed

* Adding Vignette for task 6 & Example as output excepted

* Add the output of the vignette

* Allow the user to provide his own csv file to be modified

* Classification QC fix issue on SingleChildMAtch sequencing and multipleCode

* finding the solution for gapbefore = 9

* Solution added for Qc gab before

* Adding the good output for the download data

* Adding new example for Analyse CorrespondenceTable.

* the querry now only displays the prefixes required for it to work

* the querry now only displays the prefixes required for it to work

* the querry now only displays the prefixes required for it to work

* Modification is null second parameter

* Modification allow to add multiple prefix in the list

* Modification of the prefix list

* Update of the explanation

* Update of the package modification of the queries for retrieveclassification & retrieveCorrespondenceTable
adding new example for the task 5
modification of Classification QC

* Update README.md

adding all the function

* Update README.md

* Update README.md

udpate comment of readme

* Update README.md

* Update of the classification QC

* Description de vos modifications

---------

Co-authored-by: Martina Patone <[email protected]>
Co-authored-by: martinapatone <[email protected]>
Co-authored-by: Mészáros Mátyás <[email protected]>
  • Loading branch information
4 people authored Oct 17, 2024
1 parent 6ad9bd1 commit 9d6c6e3
Show file tree
Hide file tree
Showing 46 changed files with 15,801 additions and 1,028 deletions.
8 changes: 4 additions & 4 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
Package: correspondenceTables
Type: Package
Title: Creating Correspondence Tables Between Two Statistical Classifications
Date: 2023-04-27
Version: 0.8.2
Date: 2022-09-30
Version: 0.9.9
Authors@R: c(person("Vasilis", "Chasiotis", role = "aut", comment="Department of Statistics, Athens University of Economics and Business"),
person("Photis", "Stavropoulos", role = "aut", comment="Quantos S.A. Statistics and Information Systems"),
person("Martin", "Karlberg", role = "aut"),
person("Mátyás", "Mészáros", email = "[email protected]", role = "cre"),
person("Martina", "Patone", role = "aut"),
person("Erkand", "Muraku", role = "aut"),
person("Clement", "Thomas", role = "aut"),
person("Loic", "Bienvenue", role = "aut"))
person("Loic", "Bienvenu", role = "aut"))
Description:
A candidate correspondence table between two classifications can be created when there are correspondence tables leading from the first classification to the second one via intermediate 'pivot' classifications.
The correspondence table between two statistical classifications can be updated when one of the classifications gets updated to a new version.
License: EUPL
Encoding: UTF-8
Imports: data.table, httr, tidyverse, writexl
Imports: data.table, httr, tidyverse, writexl, stringr, igraph, tools
Suggests:
knitr,
rmarkdown,
Expand Down
16 changes: 8 additions & 8 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
# Generated by roxygen2: do not edit by hand

export(aggregateCorrespondenceTable)
export(analyseCorrespondenceTable)
export(classificationEndpoint)
export(classificationQC)
export(correctionClassification)
export(correspondenceList)
export(dataStructure)
export(lengthsFile)
export(newCorrespondenceTable)
export(prefixList)
export(retrieveClassificationTable)
export(retrieveCorrespondenceTable)
export(structureData)
export(updateCorrespondenceTable)
import(httr)
import(tidyverse)
import(writexl)
import(igraph)
importFrom(data.table,fread)
importFrom(data.table,fwrite)
importFrom(httr,POST)
importFrom(httr,accept)
importFrom(httr,content)
importFrom(igraph,decompose.graph)
importFrom(igraph,graph.data.frame)
importFrom(stats,aggregate)
importFrom(stringr,str_squish)
importFrom(stringr,str_sub)
importFrom(tools,file_ext)
importFrom(utils,menu)
importFrom(utils,read.csv)
importFrom(utils,setTxtProgressBar)
Expand Down
325 changes: 325 additions & 0 deletions R/aggregateCorrespondenceTable.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
#' @title aggregateCorrespondenceTable aggregates correspondence tables to higher hierarchical levels
#' @description The `aggregateCorrespondenceTable` function is designed to aggregate correspondence tables between two hierarchical classifications A and B to higher hierarchical levels. This is particularly useful when correspondence information is needed at levels other than the most granular level. The function provides a 'mechanically defined' aggregation, offering users candidate aggregations for subsequent analysis by statistical classification experts.
#' @param AB a mandatory argument containing a correspondence table data frame with columns "Acode" and "Bcode" representing the correspondence between classifications A and B at the most granular level. This argument is mandatory
#' @param A a path to a CSV file containing source classification data with an Acode ALevel,ASuperior column. This argument is mandatory
#' @param B a path to a CSV file containing target classification data with a Bcode Blevel BSuperior column. This argument is mandatory
#'
#' @param CSVout a character string providing the path where the aggregated correspondence table CSV file should be saved. If NULL, no CSV file is generated.
#'
#' @return A data frame representing the aggregated correspondence table.
#'
#' @export
#' @examples
#' # Use data from the folder extdata
#' AB <- (system.file("extdata", "ab_data.csv", package = "correspondenceTables"))
#' A <- (system.file("extdata", "a_data.csv", package = "correspondenceTables"))
#' B <- (system.file("extdata", "b_data.csv", package = "correspondenceTables"))
#'
#'
#' result <- aggregateCorrespondenceTable(AB = AB, A = A, B = B, CSVout = FALSE)
#' print(result)
#'
aggregateCorrespondenceTable <- function(AB, A, B, CSVout = NULL ) {
# Check if input files exist
if (!file.exists(AB) || !file.exists(A) || !file.exists(B)) {
stop("One or more input files do not exist.")
}
# Check if the input files are csv
if (!grepl("\\.csv$", AB) || !grepl("\\.csv$", A) || !grepl("\\.csv$", B)) {
stop("One or more input files do not have a .csv extension.")
}

# Read the input correspondence table AB
ab_data <- read.csv2(AB, header = TRUE, sep =",")
ColumnNames <- colnames(ab_data)[1:2]
colnames(ab_data)[1:2] = c("Acode","Bcode")
# Check if AB table has the required columns
if (!all(c("Acode", "Bcode") %in% colnames(ab_data))) {
stop("Input correspondence table AB must have columns 'Acode' and 'Bcode'.")
}

# Check for duplicate combinations of Acode and Bcode in AB
if (any(duplicated(ab_data[c("Acode", "Bcode")]))) {
stop("Duplicate combinations of Acode and Bcode found in the input correspondence table AB.")
}

# Filter out records with missing Acode or Bcode
ab_data <- ab_data[!is.na(ab_data$Acode) & !is.na(ab_data$Bcode), ]

# Check if there are any records left
if (nrow(ab_data) == 0) {
stop("No valid records found in the input correspondence table AB.")
}

########
####Read the source classification table A
a_data <- read.csv(A, header = TRUE, sep = ",")
colnames(a_data)[1:3] = c("Acode","Alevel","Asuperior")
# Check if A table has the required columns
if (!all(c("Acode", "Alevel", "Asuperior") %in% colnames(a_data))) {
stop("Source classification table A must have columns 'Acode', 'Alevel', and 'Asuperior'.")
}

#Uniqueness 3.2 Check for duplicate Acode values in A
if (any(duplicated(a_data$Acode))) {
stop("Duplicate Acode values found in the source classification table A.")
}

# Check if Alevel is numeric
if (!all(is.numeric(a_data$Alevel))) {
stop("Alevel column in the source classification table A must contain numeric values.")
}

# Check if Asuperior is a character or blank
if (!all(is.character(a_data$Asuperior) | a_data$Alevel == 1)) {
stop("Asuperior column in the source classification table A must contain characters or be blank for records at level 1.")
}

# Initialize the variable to store the current level for A
mostGranularA <- max(a_data$Alevel)
currentLevelA <- mostGranularA

# Loop to check hierarchy at each level for A
while (currentLevelA >= 2) {
# Select rows at the current level and the level below
Ai <- a_data[a_data$Alevel == currentLevelA, ]
AiMinus1 <- a_data[a_data$Alevel == (currentLevelA - 1), ]

# Check if all values of Asuperior (in Ai) correspond to values of Acode (in AiMinus1)
error_rows <- which(!(Ai$Asuperior %in% AiMinus1$Acode))
if (length(error_rows) > 0) {
cat("Hierarchy error in A-data at level:", currentLevelA, "\n")
cat("Error occurred in rows:", error_rows, "\n")
break # Exit the loop if an error is detected
}

# Check if all values of Acode (in AiMinus1) correspond to values of Asuperior (in Ai)
error_rows <- which(!(AiMinus1$Acode %in% Ai$Asuperior))
if (length(error_rows) > 0) {
cat("Hierarchy error in A-data at level:", currentLevelA - 1, "\n")
cat("Error occurred in rows:", error_rows, "\n")
break # Exit the loop if an error is detected
}

# Move to the next level
currentLevelA <- currentLevelA - 1
}

# Read the target classification table B
b_data <- read.csv(B, header = TRUE, sep = ",")
colnames(b_data)[1:3] = c("Bcode","Blevel","Bsuperior")
# Check if B table has the required columns
if (!all(c("Bcode", "Blevel", "Bsuperior") %in% colnames(b_data))) {
stop("Target classification table B must have columns 'Bcode', 'Blevel', and 'Bsuperior'.")
}

#Uniqueness 3.2 Check for duplicate Bcode values in B
if (any(duplicated(b_data$Bcode))) {
stop("Duplicate Bcode values found in the target classification table B.")
}

# Check if Blevel is numeric in B
if (!all(is.numeric(b_data$Blevel))) {
stop("Blevel column in the target classification table B must contain numeric values.")
}

# Check if Bsuperior is a character or blank in B
if (!all(is.character(b_data$Bsuperior) | b_data$Blevel == 1)) {
stop("Bsuperior column in the target classification table B must contain characters or be blank for records at level 1.")
}

# Initialize the variable to store the current level
mostGranularB <- max(b_data$Blevel)
currentLevelB <- mostGranularB

# Loop to check hierarchy at each level for B
while (currentLevelB >= 2) {
# Select rows at the current level and the level below
Bi <- b_data[b_data$Blevel == currentLevelB, ]
BiMinus1 <- b_data[b_data$Blevel == (currentLevelB - 1), ]

# Check if all values of Bsuperior (in Bi) correspond to values of Bcode (in BiMinus1)
error_rows <- which(!(Bi$Bsuperior %in% BiMinus1$Bcode))
if (length(error_rows) > 0) {
cat("Hierarchy error in B-data at level:", currentLevelB, "\n")
cat("Error occurred in rows:", error_rows, "\n")
break # Exit the loop if an error is detected
}

# Check if all values of Bcode (in BiMinus1) correspond to values of Bsuperior (in Bi)
error_rows <- which(!(BiMinus1$Bcode %in% Bi$Bsuperior))
if (length(error_rows) > 0) {
cat("Hierarchy error in B_data at level:", currentLevelB - 1, "\n")
cat("Error occurred in rows:", error_rows, "\n")
break # Exit the loop if an error is detected
}

# Move to the next level
currentLevelB <- currentLevelB - 1
}

#Uniqueness Check if Acode and Bcode in AB exist in A and B respectively
if (!all(ab_data$Acode %in% a_data$Acode) || !all(ab_data$Bcode %in% b_data$Bcode)) {
stop("Acode or Bcode in the input correspondence table does not exist in source or target classification table.")
}

###3.4 Correct and complete correspondences

###add additional the column because here you just add the code you need all the column
AmostGranular <- subset(a_data, Alevel == max(Alevel), select = c(Acode, Asuperior))
BmostGranular <- subset(b_data, Blevel == max(Blevel), select = c(Bcode,Bsuperior))

AB_mostGranular <- merge(AmostGranular, BmostGranular, by.x = "Acode", by.y = "Bcode")

if (!(all(ab_data$Acode %in% AmostGranular$Acode) && all(ab_data$Bcode %in% BmostGranular$Bcode))) {
stop("Acode or Bcode in the input correspondence table does not exist in source or target classification table.")
}

if (!(all(AB_mostGranular$Acode %in% ab_data$Acode) && all(AB_mostGranular$Bcode %in% ab_data$Bcode))) {
stop("Acode or Bcode in the most granular correspondence table does not exist in the input correspondence table.")
}
########## 4.1 Creation of the table and merge it.

# Create an empty list to store the levels
A_levels <- list()

# Loop through each level and subset the data
for (i in 1:mostGranularA) {
level_data <- subset(a_data, a_data$Alevel == i, select = c(Acode, Asuperior))
A_levels[[i]] <- level_data
}

# Create an empty data frame to store the final result for A
resultA <- data.frame()

# Initialize the result with the most granular level for A
resultA <- A_levels[[mostGranularA]]

# Merge the tables hierarchically starting from the second most granular level for A
for (i in (mostGranularA - 1):1) {
level_data <- A_levels[[i]]

# Merge with the result using Asuperior and Acode columns
resultA <- merge(level_data, resultA, by.x = "Acode", by.y = "Asuperior", all.x = TRUE, all.y = TRUE, suffixes = c(paste0(".x", i), paste0(".y", i)))

# Rename columns to reflect the hierarchy for A
colnames(resultA)[colnames(resultA) == paste0("Acode.x", i)] <- paste0("Acode", i)
colnames(resultA)[colnames(resultA) == paste0("Acode.y", i)] <- paste0("Acode", (i + 1))
}

# Result will contain the final aggregated correspondence table with hierarchical code columns for A
resultA$test <- resultA[[paste0("Acode", mostGranularA)]]
resultA$Asuperior <- NULL

# Determine the most granular level dynamically
mostGranularB <- max(b_data$Blevel)

# Create an empty list to store the levels
B_levels <- list()

# Loop through each level and subset the data
for (i in 1:mostGranularB) {
level_data <- subset(b_data, b_data$Blevel == i, select = c(Bcode, Bsuperior))
B_levels[[i]] <- level_data
}

# Create an empty data frame to store the final result for B
resultB <- data.frame()

# Initialize the result with the most granular level for B
resultB <- B_levels[[mostGranularB]]

# Merge the tables hierarchically starting from the second most granular level for B
for (i in (mostGranularB - 1):1) {
level_data_B <- B_levels[[i]]

# Merge with the result using Bsuperior and Bcode columns
resultB <- merge(level_data_B, resultB, by.x = "Bcode", by.y = "Bsuperior", all.x = TRUE, all.y = TRUE, suffixes = c(paste0(".x", i), paste0(".y", i)))

# Rename columns to reflect the hierarchy for B
colnames(resultB)[colnames(resultB) == paste0("Bcode.x", i)] <- paste0("Bcode", i)
colnames(resultB)[colnames(resultB) == paste0("Bcode.y", i)] <- paste0("Bcode", (i + 1))
}

# Result will contain the final aggregated correspondence table with hierarchical code columns for B
resultB$test <- resultB[[paste0("Bcode", mostGranularB)]]
resultB$Bsuperior <- NULL


# Merge resultA and resultB using the 'test' column as the key
Merged_AB <- merge(resultA, resultB, by.x = "test", by.y = "test", all = F)
Merged_AB$test <- NULL


##Table merged
final_result <-Merged_AB
###4.2 Pairwise matching

# Identify Acode and Bcode Columns
acode_columns <- grep("^Acode", colnames(final_result), value = TRUE)
bcode_columns <- grep("^Bcode", colnames(final_result), value = TRUE)

# Loop Through Acode and Bcode Columns
results_matrices <- list()

for (acode_column in acode_columns) {
level_Acode <- match(acode_column, acode_columns)

for (bcode_column in bcode_columns) {
level_Bcode <- match(bcode_column, bcode_columns)

unique_combinations <- unique(final_result[, c(acode_column, bcode_column)])

for (i in 1:nrow(unique_combinations)) {
combination <- unique_combinations[i, ]

# Perform the equality comparison using subset
subset_data <- final_result[final_result[[acode_column]] == combination[[acode_column]] &
final_result[[bcode_column]] == combination[[bcode_column]], ]

# Count Unique Occurrences
count_Acode <- sapply(acode_columns, function(col) length(unique(subset_data[[col]])))
count_Bcode <- sapply(bcode_columns, function(col) length(unique(subset_data[[col]])))

# Build Results Matrix
result_matrix <- c(level_Acode, level_Bcode, combination[[acode_column]], combination[[bcode_column]], count_Acode, count_Bcode)
results_matrices <- append(results_matrices, list(result_matrix))
}
}
}

# Convert Matrices to Dataframe
results_df <- as.data.frame(do.call(rbind, results_matrices))

# Extract levels from Acode and Bcode columns
acode_levels <- gsub("^Acode(\\d+)$", "\\1", acode_columns)
bcode_levels <- gsub("^Bcode(\\d+)$", "\\1", bcode_columns)
max_levels <- max(length(acode_levels), length(bcode_levels))
acode_levels <- rep(acode_levels, length.out = max_levels)
bcode_levels <- rep(bcode_levels, length.out = max_levels)
# Define the new column names
new_colnames <- c(paste0(ColumnNames[1]," level"),
paste0(ColumnNames[2]," level"),
ColumnNames[1],
ColumnNames[2],
paste0("N of ", ColumnNames[1], acode_levels, " level values "),
paste0("N of ", ColumnNames[2] , bcode_levels , " level values "))

# Update column names in results_df
colnames(results_df) <- new_colnames
# Display Results

if (!is.null(CSVout)) {
if (is.logical(CSVout) && CSVout == TRUE) {
file_name <- paste0("AgrgregateCorrespondeceTable_", ColumnNames[1], " & ",ColumnNames[2], ".csv")
path_file <- file.path(getwd(), file_name)
write.csv(results_df, path_file, row.names = FALSE)
message(paste0("The table was saved in ", getwd(), file_name))
} else if (is.character(CSVout)) {
write.csv(results_df, CSVout, row.names = FALSE)
}
}
return(results_df)

}

Loading

0 comments on commit 9d6c6e3

Please sign in to comment.