Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you clean your dataset? #1

Open
kodxana opened this issue Feb 24, 2021 · 4 comments
Open

How do you clean your dataset? #1

kodxana opened this issue Feb 24, 2021 · 4 comments

Comments

@kodxana
Copy link

kodxana commented Feb 24, 2021

For me I'm getting something like this after cleaning

User1#6840
n/m

 User2#6840
got it working

 User3#6840
anyone else in same boat as i was
@h4nkyn
Copy link

h4nkyn commented May 19, 2021

If you're using Python, loop through each line as line.lstrip().

@shreesha345
Copy link

can you send a code for cleaning it by python

@ghost
Copy link

ghost commented Nov 13, 2021

Hey, I had this issue as well, so I made a script that does this.

#!/bin/sh
# Description: Scrub txt data downloaded by DiscordChatExporter for use in yourAI
# Usage: discrub <input file> <output file>

# Remove 'Guild' message up top
tail -n +6 "$1" |
	# Delete unnecessary data, bad users, ^M characters, urls and magnet links and format code blocks better
	awk '/{Embed}/ || /{Attachments}/ || /{Reactions}/ || /Joined the server./ || /Pinned a message/ || /Dad Bot/ || /NotSoBot/ { found1 = 1 ; next } /\[..-.*\]/ { found1 = 0 } ! found1 { gsub(/
/, "") ; if (!/\[..-.*\]/) { gsub(/http[s]?:\/\/[^[:space:]]*/, "") ; gsub(/magnet:\?xt=urn:btih:.*/, "") } sub(/```/, "|||") ; print }' |
	# Remove empty messages
	awk 'found1 { if (/\[..-.*\]/ || /^$/) { prev = $0 ; next } found1 = 0 ; print prev ; print ; next } /\[..-.*\]/ { found1 = 1 ; prev = $0 ; next } { print }' |
	# Change two empty lines to one empty lines after each message, remove timestamp in message, and add ':' after usernames
	awk 'found2 { found2 = 0 ; if (/^$/) { next } } found1 { if (/^$/) { found2 = 1 } } /\[..-.*\]/ { found1 = 1 ; sub(/\[..-.*\] /, "") ; print $0":" ; next } { print }' > "$2"

It only works on Unix systems, but I hope it helps.

@wisplite
Copy link

For some reason, on my system (Debian 11), that script just created a blank text file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants