You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the numbering of paragraph is inconsistent. There are two inconsistencies that I've noticed so far.
Lack of numbering: The first paragraph of AFG_07_1952.txt, and AFG_17_1962.txt for example, does not have any number assigned. AFG_26_1971.txt, on the other hand, has numbering for all of the paragraphs.
Spacing between the number and text: Paragraph number 16 of AFG_17_1962.txt seems to have spaces, instead of a tab, to separate index and the main text.
Below issues might arise because of the above patterns, but there were several challenges while cleaning the text.
# Reading in data
raw<- readRDS("UNGDC.rds") # I compiled all .txt files as a single dataframe.
processed<-raw
#remove numbering
processed$text<-gsub("\\d+\\.\\t", "", raw$text, perl = TRUE)
I filtered strings with the pattern of digits followed by a dot and a tab. However, this didn't remove the paragraph indices.
The text was updated successfully, but these errors were encountered:
Jihyeonbae
changed the title
Corpus cleaning
Inconsistency in paragraph numbering
Jan 9, 2024
Hello! @sjankin @reyhanehHashempour
It seems that the numbering of paragraph is inconsistent. There are two inconsistencies that I've noticed so far.
Below issues might arise because of the above patterns, but there were several challenges while cleaning the text.
I filtered strings with the pattern of digits followed by a dot and a tab. However, this didn't remove the paragraph indices.
The text was updated successfully, but these errors were encountered: