QTM1 Code (#4)

* Pkg updates with implementation of Task 1 and 2 * Update DESCRIPTION Authors added * Update sample data path and added authors * Remove R from man * gitignore updated * DESCRIPTION add httr dependence * Add functions classEndpoin correspondenceList prefixList retrieveClassificationTable retrieveCorrespondenceTable structureData * rm library * Update functions * Documentation functions * Update functions description and first version of vignettes * rm vignettes * Update documentation classification table and vignettes * Update vignettes * htlm vignettes * Update documentation * update vignette * update vignette * update vignettes * Add CSVout parameter to both retrieve function and update documentation and vignette * add import httr * remove tables * -add ShowQUERY argument -change functions names (dataStructure and Classification table) -add columns and change orders to correspondence tables -remove prefix duplicates -final docs -others * Final version (there might be error to check) * vignette corrected * quiete messages sparql query * correct typos * Update final changes * Updated vignette * add ALL as default to classificationEndpoint * change doc for classificationEndpoint * remove table * Functions task 4 * Upload drafted documentation * task 4 draft * try function classification QC * UPDATE * update * Update view is an issue * update * Update Corrections for the rule 7 8 for Classification * Modify the function remove the link for both csv file on rule 7 & 8 * Modify the function remove the link for both csv file on rule 7 & 8 * change * add rmd and vignette * modification of the commentatry on the different function * Update of the function Correspondences Table add update from Classification QC and add new function AnalyseCorrespondenceQC * put FAO in commentaire because the query has change so he will product some error message * MODIFICATION * remove because don't need it now * ZREMOVE * Update * remove * Rename the name of the variable on the explanation part * Remove the correction for the data * Vignette of the task 5 * updated * updated * modification and add the vignette * REMOVE * Modify the comment * add the vignette for the analyse CorrespondenceTable * Update csvout * Modify the reader for the csv * Modify the imput of the CSV and put the good argument for the function * modify the path for reading csv * Modify the csvoutput the user can put his own path to create his own csv file * Update it * Add one column and allow the user to put a path * update * update * mistake * white space * testing * Modify the example remove all read with dataframe. Modify the commentary of the function Add new example on the vignettes * General update * Next version * Modify the example to use just a path * Create a csvFile when running the function of ClassificationQC. * Modify the comment * mistake * change it * remove a details * correction * fix issue * push * remove comments on example * Uopdate verification on the different csv file Fixed issue on the duplicate and about the 3 error message * remove text * Modification of the Query the old one retrieve nothing * Erreur de saisie sur les exemples * forget the parameters Level * fixed * Modifiy the bulle * deleted * removed * New version * erreur de frappe * FORGET TO PUT IN COMMENT * modification of the query and had a function to see the output clearly "cat" * Modification of the order of the output * changes the example for retrieveCorrespondenceTable to retrieve data * Put it in comment take too much time and gave an error * rework of the checking from the csv file for lenghts sequencing multiplecode And rework for the CSVOUT * Erreur de saisie * fix * push * Adding data and example for the vignette ClassificationQc * remove some weird comment * update * add more comment because it's confusing * remove * Testing QcOutput * add gsub to obtain a clean csv file * Adding the output for the different rules * Testing RetrieveClassification() with other languages * modification of the query and adapt the code in order to retrieve the prefix and conceptScheme * Adding an option for the user if he put his own csv the function will write the output inside * changing the example with an other language * Adding a real exaample with singleChildCode * Correcting the querry for FAO and R005 FIXED * adding * Testing the vignette with another language * Changing the format * Modification of the security for the sequencing file and modification for the Sequencing part * Add another example for singleCode * adding tools & 9.6 * un import en trop * Example Task 6 * Remove comment * remowe whitespace * remove * removed * Add tasking 6 * REMOVE EXAMPLE * new version * Modification of the sequencing argument the user now can put the different levels for testing the sequencing * Adding aggregate CorrespondanceTable * Update the description * adding * Adding Example for ClassificationQC * New version with example * remove na.omit * modify commentary and fix the csv file * remove view * Issue fixed * Adding Vignette for task 6 & Example as output excepted * Add the output of the vignette * Allow the user to provide his own csv file to be modified * Classification QC fix issue on SingleChildMAtch sequencing and multipleCode * finding the solution for gapbefore = 9 * Solution added for Qc gab before * Adding the good output for the download data * Adding new example for Analyse CorrespondenceTable. * the querry now only displays the prefixes required for it to work * the querry now only displays the prefixes required for it to work * the querry now only displays the prefixes required for it to work * Modification is null second parameter * Modification allow to add multiple prefix in the list * Modification of the prefix list * Update of the explanation * Update of the package modification of the queries for retrieveclassification & retrieveCorrespondenceTable adding new example for the task 5 modification of Classification QC * Update README.md adding all the function * Update README.md * Update README.md udpate comment of readme * Update README.md * Update of the classification QC * Description de vos modifications --------- Co-authored-by: Martina Patone <[email protected]> Co-authored-by: martinapatone <[email protected]> Co-authored-by: Mészáros Mátyás <[email protected]>
eurostat · Oct 17, 2024 · 9d6c6e3 · 9d6c6e3
1 parent 6ad9bd1
commit 9d6c6e3
Show file tree

Hide file tree

Showing 46 changed files with 15,801 additions and 1,028 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,22 +1,22 @@
 Package: correspondenceTables
 Type: Package
 Title: Creating Correspondence Tables Between Two Statistical Classifications
-Date: 2023-04-27
-Version: 0.8.2
+Date: 2022-09-30
+Version: 0.9.9
 Authors@R: c(person("Vasilis", "Chasiotis", role = "aut", comment="Department of Statistics, Athens University of Economics and Business"), 
       person("Photis", "Stavropoulos", role = "aut", comment="Quantos S.A. Statistics and Information Systems"),
       person("Martin", "Karlberg", role = "aut"),
       person("Mátyás", "Mészáros", email = "[email protected]", role = "cre"),
       person("Martina", "Patone", role = "aut"),
       person("Erkand", "Muraku", role = "aut"), 
       person("Clement", "Thomas", role = "aut"), 
-      person("Loic", "Bienvenue", role = "aut"))
+      person("Loic", "Bienvenu", role = "aut"))
 Description: 
     A candidate correspondence table between two classifications can be created when there are correspondence tables leading from the first classification to the second one via intermediate 'pivot' classifications. 
     The correspondence table between two statistical classifications can be updated when one of the classifications gets updated to a new version.
 License: EUPL
 Encoding: UTF-8
-Imports: data.table, httr, tidyverse, writexl
+Imports: data.table, httr, tidyverse, writexl, stringr, igraph, tools
 Suggests: 
     knitr,
     rmarkdown,

diff --git a/NAMESPACE b/NAMESPACE
@@ -1,26 +1,26 @@
 # Generated by roxygen2: do not edit by hand
 
+export(aggregateCorrespondenceTable)
+export(analyseCorrespondenceTable)
 export(classificationEndpoint)
 export(classificationQC)
-export(correctionClassification)
 export(correspondenceList)
 export(dataStructure)
-export(lengthsFile)
 export(newCorrespondenceTable)
 export(prefixList)
 export(retrieveClassificationTable)
 export(retrieveCorrespondenceTable)
-export(structureData)
 export(updateCorrespondenceTable)
 import(httr)
-import(tidyverse)
-import(writexl)
+import(igraph)
 importFrom(data.table,fread)
 importFrom(data.table,fwrite)
-importFrom(httr,POST)
-importFrom(httr,accept)
-importFrom(httr,content)
+importFrom(igraph,decompose.graph)
+importFrom(igraph,graph.data.frame)
 importFrom(stats,aggregate)
+importFrom(stringr,str_squish)
+importFrom(stringr,str_sub)
+importFrom(tools,file_ext)
 importFrom(utils,menu)
 importFrom(utils,read.csv)
 importFrom(utils,setTxtProgressBar)

diff --git a/R/aggregateCorrespondenceTable.R b/R/aggregateCorrespondenceTable.R
@@ -0,0 +1,325 @@
+#' @title aggregateCorrespondenceTable aggregates correspondence tables to higher hierarchical levels
+#' @description The `aggregateCorrespondenceTable` function is designed to aggregate correspondence tables between two hierarchical classifications A and B to higher hierarchical levels. This is particularly useful when correspondence information is needed at levels other than the most granular level. The function provides a 'mechanically defined' aggregation, offering users candidate aggregations for subsequent analysis by statistical classification experts.
+#' @param AB a mandatory argument containing a correspondence table data frame with columns "Acode" and "Bcode" representing the correspondence between classifications A and B at the most granular level. This argument is mandatory
+#' @param A a path to a CSV file containing source classification data with an Acode ALevel,ASuperior column. This argument is mandatory 
+#' @param B a path to a CSV file containing target classification data with a Bcode Blevel BSuperior column. This argument is mandatory
+#'
+#' @param CSVout a character string providing the path where the aggregated correspondence table CSV file should be saved. If NULL, no CSV file is generated.
+#'
+#' @return A data frame representing the aggregated correspondence table.
+#'
+#' @export
+#' @examples 
+#' # Use data from the folder extdata
+#' AB <- (system.file("extdata", "ab_data.csv", package = "correspondenceTables"))
+#' A <- (system.file("extdata", "a_data.csv", package = "correspondenceTables"))
+#' B <- (system.file("extdata", "b_data.csv", package = "correspondenceTables"))
+#' 
+#' 
+#' result <- aggregateCorrespondenceTable(AB = AB, A = A, B = B, CSVout = FALSE)
+#' print(result)
+#'
+aggregateCorrespondenceTable <- function(AB, A, B, CSVout = NULL ) {
+  # Check if input files exist
+  if (!file.exists(AB) || !file.exists(A) || !file.exists(B)) {
+    stop("One or more input files do not exist.")
+  }
+  # Check if the input files are csv
+  if (!grepl("\\.csv$", AB) || !grepl("\\.csv$", A) || !grepl("\\.csv$", B)) {
+    stop("One or more input files do not have a .csv extension.")
+  }
+
+  # Read the input correspondence table AB
+  ab_data <- read.csv2(AB, header = TRUE, sep =",")
+  ColumnNames <- colnames(ab_data)[1:2]
+  colnames(ab_data)[1:2] = c("Acode","Bcode")
+  # Check if AB table has the required columns
+  if (!all(c("Acode", "Bcode") %in% colnames(ab_data))) {
+    stop("Input correspondence table AB must have columns 'Acode' and 'Bcode'.")
+  }
+
+  # Check for duplicate combinations of Acode and Bcode in AB
+  if (any(duplicated(ab_data[c("Acode", "Bcode")]))) {
+    stop("Duplicate combinations of Acode and Bcode found in the input correspondence table AB.")
+  }
+
+  # Filter out records with missing Acode or Bcode
+  ab_data <- ab_data[!is.na(ab_data$Acode) & !is.na(ab_data$Bcode), ]
+
+  # Check if there are any records left
+  if (nrow(ab_data) == 0) {
+    stop("No valid records found in the input correspondence table AB.")
+  }
+
+  ######## 
+  ####Read the source classification table A
+  a_data <- read.csv(A, header = TRUE, sep = ",")
+  colnames(a_data)[1:3] = c("Acode","Alevel","Asuperior")
+  # Check if A table has the required columns
+  if (!all(c("Acode", "Alevel", "Asuperior") %in% colnames(a_data))) {
+    stop("Source classification table A must have columns 'Acode', 'Alevel', and 'Asuperior'.")
+  }
+
+  #Uniqueness 3.2 Check for duplicate Acode values in A
+  if (any(duplicated(a_data$Acode))) {
+    stop("Duplicate Acode values found in the source classification table A.")
+  }
+
+  # Check if Alevel is numeric
+  if (!all(is.numeric(a_data$Alevel))) {
+    stop("Alevel column in the source classification table A must contain numeric values.")
+  }
+
+  # Check if Asuperior is a character or blank
+  if (!all(is.character(a_data$Asuperior) | a_data$Alevel == 1)) {
+    stop("Asuperior column in the source classification table A must contain characters or be blank for records at level 1.")
+  }
+
+  # Initialize the variable to store the current level for A
+  mostGranularA <- max(a_data$Alevel)
+  currentLevelA <- mostGranularA
+
+  # Loop to check hierarchy at each level for A
+  while (currentLevelA >= 2) {
+    # Select rows at the current level and the level below
+    Ai <- a_data[a_data$Alevel == currentLevelA, ]
+    AiMinus1 <- a_data[a_data$Alevel == (currentLevelA - 1), ]
+
+    # Check if all values of Asuperior (in Ai) correspond to values of Acode (in AiMinus1)
+    error_rows <- which(!(Ai$Asuperior %in% AiMinus1$Acode))
+    if (length(error_rows) > 0) {
+      cat("Hierarchy error in A-data at level:", currentLevelA, "\n")
+      cat("Error occurred in rows:", error_rows, "\n")
+      break  # Exit the loop if an error is detected
+    }
+
+    # Check if all values of Acode (in AiMinus1) correspond to values of Asuperior (in Ai)
+    error_rows <- which(!(AiMinus1$Acode %in% Ai$Asuperior))
+    if (length(error_rows) > 0) {
+      cat("Hierarchy error in A-data at level:", currentLevelA - 1, "\n")
+      cat("Error occurred in rows:", error_rows, "\n")
+      break  # Exit the loop if an error is detected
+    }
+
+    # Move to the next level
+    currentLevelA <- currentLevelA - 1
+  }
+
+  # Read the target classification table B
+  b_data <- read.csv(B, header = TRUE, sep = ",")
+  colnames(b_data)[1:3] = c("Bcode","Blevel","Bsuperior")
+  # Check if B table has the required columns
+  if (!all(c("Bcode", "Blevel", "Bsuperior") %in% colnames(b_data))) {
+    stop("Target classification table B must have columns 'Bcode', 'Blevel', and 'Bsuperior'.")
+  }
+
+  #Uniqueness 3.2 Check for duplicate Bcode values in B
+  if (any(duplicated(b_data$Bcode))) {
+    stop("Duplicate Bcode values found in the target classification table B.")
+  }
+
+  # Check if Blevel is numeric in B
+  if (!all(is.numeric(b_data$Blevel))) {
+    stop("Blevel column in the target classification table B must contain numeric values.")
+  }
+
+  # Check if Bsuperior is a character or blank in B
+  if (!all(is.character(b_data$Bsuperior) | b_data$Blevel == 1)) {
+    stop("Bsuperior column in the target classification table B must contain characters or be blank for records at level 1.")
+  }
+
+  # Initialize the variable to store the current level
+  mostGranularB <- max(b_data$Blevel)
+  currentLevelB <- mostGranularB
+
+  # Loop to check hierarchy at each level for B
+  while (currentLevelB >= 2) {
+    # Select rows at the current level and the level below
+    Bi <- b_data[b_data$Blevel == currentLevelB, ]
+    BiMinus1 <- b_data[b_data$Blevel == (currentLevelB - 1), ]
+
+    # Check if all values of Bsuperior (in Bi) correspond to values of Bcode (in BiMinus1)
+    error_rows <- which(!(Bi$Bsuperior %in% BiMinus1$Bcode))
+    if (length(error_rows) > 0) {
+      cat("Hierarchy error in B-data at level:", currentLevelB, "\n")
+      cat("Error occurred in rows:", error_rows, "\n")
+      break  # Exit the loop if an error is detected
+    }
+
+    # Check if all values of Bcode (in BiMinus1) correspond to values of Bsuperior (in Bi)
+    error_rows <- which(!(BiMinus1$Bcode %in% Bi$Bsuperior))
+    if (length(error_rows) > 0) {
+      cat("Hierarchy error in B_data at level:", currentLevelB - 1, "\n")
+      cat("Error occurred in rows:", error_rows, "\n")
+      break  # Exit the loop if an error is detected
+    }
+
+    # Move to the next level
+    currentLevelB <- currentLevelB - 1
+  }
+
+  #Uniqueness  Check if Acode and Bcode in AB exist in A and B respectively
+  if (!all(ab_data$Acode %in% a_data$Acode) || !all(ab_data$Bcode %in% b_data$Bcode)) {
+    stop("Acode or Bcode in the input correspondence table does not exist in source or target classification table.")
+  }
+
+ ###3.4 Correct and complete correspondences 
+
+  ###add additional the column because here you just add the code you need all the column
+  AmostGranular <- subset(a_data, Alevel == max(Alevel), select = c(Acode, Asuperior))
+  BmostGranular <- subset(b_data, Blevel == max(Blevel), select = c(Bcode,Bsuperior))
+
+  AB_mostGranular <- merge(AmostGranular, BmostGranular, by.x = "Acode", by.y = "Bcode")
+
+  if (!(all(ab_data$Acode %in% AmostGranular$Acode) && all(ab_data$Bcode %in% BmostGranular$Bcode))) {
+    stop("Acode or Bcode in the input correspondence table does not exist in source or target classification table.")
+  }
+
+  if (!(all(AB_mostGranular$Acode %in% ab_data$Acode) && all(AB_mostGranular$Bcode %in% ab_data$Bcode))) {
+    stop("Acode or Bcode in the most granular correspondence table does not exist in the input correspondence table.")
+  }
+  ########## 4.1 Creation of the table and merge it. 
+
+  # Create an empty list to store the levels
+  A_levels <- list()
+
+  # Loop through each level and subset the data
+  for (i in 1:mostGranularA) {
+    level_data <- subset(a_data, a_data$Alevel == i, select = c(Acode, Asuperior))
+    A_levels[[i]] <- level_data
+  }
+
+  # Create an empty data frame to store the final result for A
+  resultA <- data.frame()
+
+  # Initialize the result with the most granular level for A
+  resultA <- A_levels[[mostGranularA]]
+
+  # Merge the tables hierarchically starting from the second most granular level for A
+  for (i in (mostGranularA - 1):1) {
+    level_data <- A_levels[[i]]
+
+    # Merge with the result using Asuperior and Acode columns
+    resultA <- merge(level_data, resultA, by.x = "Acode", by.y = "Asuperior", all.x = TRUE, all.y = TRUE, suffixes = c(paste0(".x", i), paste0(".y", i)))
+
+    # Rename columns to reflect the hierarchy for A
+    colnames(resultA)[colnames(resultA) == paste0("Acode.x", i)] <- paste0("Acode", i)
+    colnames(resultA)[colnames(resultA) == paste0("Acode.y", i)] <- paste0("Acode", (i + 1))
+  }
+
+  # Result will contain the final aggregated correspondence table with hierarchical code columns for A
+  resultA$test <- resultA[[paste0("Acode", mostGranularA)]]
+  resultA$Asuperior <- NULL
+
+  # Determine the most granular level dynamically
+  mostGranularB <- max(b_data$Blevel)
+
+  # Create an empty list to store the levels
+  B_levels <- list()
+
+  # Loop through each level and subset the data
+  for (i in 1:mostGranularB) {
+    level_data <- subset(b_data, b_data$Blevel == i, select = c(Bcode, Bsuperior))
+    B_levels[[i]] <- level_data
+  }
+
+  # Create an empty data frame to store the final result for B
+  resultB <- data.frame()
+
+  # Initialize the result with the most granular level for B
+  resultB <- B_levels[[mostGranularB]]
+
+  # Merge the tables hierarchically starting from the second most granular level for B
+  for (i in (mostGranularB - 1):1) {
+    level_data_B <- B_levels[[i]]
+
+    # Merge with the result using Bsuperior and Bcode columns
+    resultB <- merge(level_data_B, resultB, by.x = "Bcode", by.y = "Bsuperior", all.x = TRUE, all.y = TRUE, suffixes = c(paste0(".x", i), paste0(".y", i)))
+
+    # Rename columns to reflect the hierarchy for B
+    colnames(resultB)[colnames(resultB) == paste0("Bcode.x", i)] <- paste0("Bcode", i)
+    colnames(resultB)[colnames(resultB) == paste0("Bcode.y", i)] <- paste0("Bcode", (i + 1))
+  }
+
+  # Result will contain the final aggregated correspondence table with hierarchical code columns for B
+  resultB$test <- resultB[[paste0("Bcode", mostGranularB)]]
+  resultB$Bsuperior <- NULL
+
+
+  # Merge resultA and resultB using the 'test' column as the key
+  Merged_AB <- merge(resultA, resultB, by.x = "test", by.y = "test", all = F)
+  Merged_AB$test <- NULL
+
+
+  ##Table merged 
+  final_result <-Merged_AB
+  ###4.2 Pairwise matching
+
+  # Identify Acode and Bcode Columns
+  acode_columns <- grep("^Acode", colnames(final_result), value = TRUE)
+  bcode_columns <- grep("^Bcode", colnames(final_result), value = TRUE)
+
+  # Loop Through Acode and Bcode Columns
+  results_matrices <- list()
+
+  for (acode_column in acode_columns) {
+    level_Acode <- match(acode_column, acode_columns)
+
+    for (bcode_column in bcode_columns) {
+      level_Bcode <- match(bcode_column, bcode_columns)
+
+      unique_combinations <- unique(final_result[, c(acode_column, bcode_column)])
+
+      for (i in 1:nrow(unique_combinations)) {
+        combination <- unique_combinations[i, ]
+
+        # Perform the equality comparison using subset
+        subset_data <- final_result[final_result[[acode_column]] == combination[[acode_column]] & 
+                                      final_result[[bcode_column]] == combination[[bcode_column]], ]
+
+        # Count Unique Occurrences
+        count_Acode <- sapply(acode_columns, function(col) length(unique(subset_data[[col]])))
+        count_Bcode <- sapply(bcode_columns, function(col) length(unique(subset_data[[col]])))
+
+        # Build Results Matrix
+        result_matrix <- c(level_Acode, level_Bcode, combination[[acode_column]], combination[[bcode_column]], count_Acode, count_Bcode)
+        results_matrices <- append(results_matrices, list(result_matrix))
+      }
+    }
+  }
+
+  # Convert Matrices to Dataframe
+  results_df <- as.data.frame(do.call(rbind, results_matrices))
+
+  # Extract levels from Acode and Bcode columns
+  acode_levels <- gsub("^Acode(\\d+)$", "\\1", acode_columns)
+  bcode_levels <- gsub("^Bcode(\\d+)$", "\\1", bcode_columns)
+  max_levels <- max(length(acode_levels), length(bcode_levels))
+  acode_levels <- rep(acode_levels, length.out = max_levels)
+  bcode_levels <- rep(bcode_levels, length.out = max_levels)
+  # Define the new column names
+  new_colnames <- c(paste0(ColumnNames[1]," level"),
+                    paste0(ColumnNames[2]," level"),
+                    ColumnNames[1],
+                    ColumnNames[2],
+                    paste0("N of ", ColumnNames[1], acode_levels, " level values "),
+                    paste0("N of ", ColumnNames[2] , bcode_levels , " level values "))
+
+  # Update column names in results_df
+  colnames(results_df) <- new_colnames
+  # Display Results
+
+  if (!is.null(CSVout)) {
+    if (is.logical(CSVout) && CSVout == TRUE) {
+      file_name <- paste0("AgrgregateCorrespondeceTable_", ColumnNames[1], " & ",ColumnNames[2], ".csv")
+      path_file <- file.path(getwd(), file_name)
+      write.csv(results_df, path_file, row.names = FALSE)
+      message(paste0("The table was saved in ", getwd(), file_name))
+    } else if (is.character(CSVout)) {
+      write.csv(results_df, CSVout, row.names = FALSE)
+    }
+  }
+  return(results_df)
+
+}
+