How do you clean your dataset? #1

kodxana · 2021-02-24T10:41:52Z

For me I'm getting something like this after cleaning

User1#6840
n/m

 User2#6840
got it working

 User3#6840
anyone else in same boat as i was

The text was updated successfully, but these errors were encountered:

h4nkyn · 2021-05-19T02:46:11Z

If you're using Python, loop through each line as line.lstrip().

shreesha345 · 2021-07-01T05:47:43Z

can you send a code for cleaning it by python

ghost · 2021-11-13T13:53:15Z

Hey, I had this issue as well, so I made a script that does this.

#!/bin/sh
# Description: Scrub txt data downloaded by DiscordChatExporter for use in yourAI
# Usage: discrub <input file> <output file>

# Remove 'Guild' message up top
tail -n +6 "$1" |
	# Delete unnecessary data, bad users, ^M characters, urls and magnet links and format code blocks better
	awk '/{Embed}/ || /{Attachments}/ || /{Reactions}/ || /Joined the server./ || /Pinned a message/ || /Dad Bot/ || /NotSoBot/ { found1 = 1 ; next } /\[..-.*\]/ { found1 = 0 } ! found1 { gsub(/
/, "") ; if (!/\[..-.*\]/) { gsub(/http[s]?:\/\/[^[:space:]]*/, "") ; gsub(/magnet:\?xt=urn:btih:.*/, "") } sub(/```/, "|||") ; print }' |
	# Remove empty messages
	awk 'found1 { if (/\[..-.*\]/ || /^$/) { prev = $0 ; next } found1 = 0 ; print prev ; print ; next } /\[..-.*\]/ { found1 = 1 ; prev = $0 ; next } { print }' |
	# Change two empty lines to one empty lines after each message, remove timestamp in message, and add ':' after usernames
	awk 'found2 { found2 = 0 ; if (/^$/) { next } } found1 { if (/^$/) { found2 = 1 } } /\[..-.*\]/ { found1 = 1 ; sub(/\[..-.*\] /, "") ; print $0":" ; next } { print }' > "$2"

It only works on Unix systems, but I hope it helps.

wisplite · 2021-12-24T04:42:22Z

For some reason, on my system (Debian 11), that script just created a blank text file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you clean your dataset? #1

How do you clean your dataset? #1

kodxana commented Feb 24, 2021

h4nkyn commented May 19, 2021

shreesha345 commented Jul 1, 2021

ghost commented Nov 13, 2021

wisplite commented Dec 24, 2021

How do you clean your dataset? #1

How do you clean your dataset? #1

Comments

kodxana commented Feb 24, 2021

h4nkyn commented May 19, 2021

shreesha345 commented Jul 1, 2021

ghost commented Nov 13, 2021

wisplite commented Dec 24, 2021