Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in paragraph numbering #6

Open
Jihyeonbae opened this issue Jan 9, 2024 · 0 comments
Open

Inconsistency in paragraph numbering #6

Jihyeonbae opened this issue Jan 9, 2024 · 0 comments

Comments

@Jihyeonbae
Copy link

Jihyeonbae commented Jan 9, 2024

Hello! @sjankin @reyhanehHashempour

It seems that the numbering of paragraph is inconsistent. There are two inconsistencies that I've noticed so far.

  1. Lack of numbering: The first paragraph of AFG_07_1952.txt, and AFG_17_1962.txt for example, does not have any number assigned. AFG_26_1971.txt, on the other hand, has numbering for all of the paragraphs.
  2. Spacing between the number and text: Paragraph number 16 of AFG_17_1962.txt seems to have spaces, instead of a tab, to separate index and the main text.

Below issues might arise because of the above patterns, but there were several challenges while cleaning the text.

# Reading in data
raw<- readRDS("UNGDC.rds") # I compiled all .txt files as a single dataframe.
processed<-raw

#remove numbering
processed$text<-gsub("\\d+\\.\\t", "", raw$text, perl = TRUE)

I filtered strings with the pattern of digits followed by a dot and a tab. However, this didn't remove the paragraph indices.

@Jihyeonbae Jihyeonbae changed the title Corpus cleaning Inconsistency in paragraph numbering Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant