Extracting & Cleaning Twitter Data

It seems like I spend most of my life extracting and cleaning Twitter data. Tweets are always full of garbage that cause errors when processing them.

Part I: A few useful data extraction snippets…

Before dumping it is useful to remove all uses of your intended CSV separator and line breaks from your database values (TwitterData here is the text column, but this may also apply to userscreenname, location etc.)

UPDATE TwitterData SET text=REPLACE(text, '|', '') WHERE text like '%|%';

UPDATE TwitterData SET text=REPLACE(text, '\n', '') WHERE text like '%\n%';

UPDATE TwitterData SET text=REPLACE(text, '\r', '') WHERE text like '%\r%';

Dump monthly data from SQLite to CSV and import into new table (this is useful when tweets are in SQLite format and have become too voluminous to manipulate due to DB file size – TIP: keeping them in JSON and manipulating them from there is much easier if you can!). This also allows you to clean up the data while in CSV form, before importing back into a DB (see second part of this post).

sudo sqlite3 DBNAME.sqlite
sqlite> .mode csv
sqlite> .separator |
sqlite> .out OUTFILE.csv
sqlite> select created, geolocLat, geolocLong, text from TwitterData where created like '02/__/2014 __:__:__';
sqlite> .output stdout
sqlite> .quit

You may also need to transform the date (COSMOS has a specific date format). You can do this at the DB query stage (where created is the stored timestamp)

select substr(created,4,2)||'/'||substr(created,1,2)||'/'||substr(created,7,4)||' '||substr(created,12,8), geolocLat, geolocLong, text from TwitterData

Import CSV into a new table

Note: you may also want to declare column types here (e.g. geolocLat needs to be declared REAL if you want to perform > or < queries)

sudo sqlite3 feb14.db
sqlite> create table TwitterData (created, geolocLat, geolocLong, text);
sqlite> .mode csv
sqlite> .separator |
sqlite> .import OUTFILE.csv TwitterData

Select distinct dates within a month from a database of timestamped data (assumes column named ‘created’ with format ‘mm/dd/yyyy hh:mm:ss’)

sqlite> select distinct substr(created,0,11) from TwitterData where created like '06/__/2014 __:__:__';

Part II: A few useful cleaning snippets…

General note, I use INFILE > OUTFILE so I can check the output and not modify the input in case the query is incorrect. Use ‘sed -i’ to make changes in-file

Print first 10 lines of a file to inspect it

sed -n "1,10p" FILENAME | cat -n

Remove non-ASCII characters from a file

UBUNTU

sudo LANG=C sed -i 's/[\d128-\d255]//g' FILENAME

OSX

sudo LC_ALL=C sed -i "" 's/[\d128-\d255]//g' FILENAME

Remove line breaks in the middle of a line

sudo awk '{printf "%s%s",(/^"/&&NR>1)?RS:"",$0}' INFILE > OUTFILE

Remove all lines that do not start with a date (e.g 02/01/2014 12:01:34)

sudo awk '{printf "%s%s",(/^[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]/&&NR>1)?RS:"",$0}' INFILE > OUTFILE

Delete a range of lines in a file (lines 2360-2361 in this case)

sed "2360,2361d" INFILE > OUTFILE

Delete lines shorter than n characters (n=10 in this case)

sudo sed -r '/^.{,10}$/d' INFILE > OUTFILE

Delete quotation marks from file

UBUNTU

sudo sed -i 's/\"//g' FILENAME

OSX

sudo LC_ALL=C sed -i "" 's/\"//g' FILENAME

Delete 4th instance of |

sed 's/|//4' INFILE > OUTFILE
Delete all instances of | after 4th instance
sudo sed -i 's/|//g4' INFILE
Grep lines not starting with specific dates
^(?!24/04/2015 |25/04/2015 |26/04/2015 |27/04/2015 |28/04/2015 |29/04/2015 |30/04/2015 ).*
┬áPart III – Full script for cleaning monthly data (this won’t work for everyone of course)
First run SQLite dump to CSV
Then these commands…
sudo sed -i 's/\"//g' OUTFILE

sudo awk '{printf "%s%s",(/^[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]/&&NR>1)?RS:"",$0}' INFILE > OUTFILE 

sudo LANG=C sed -i 's/[\d128-\d255]//g' OUTFILE 

sudo sed -i 's/|//g4' OUTFILE

Then run SQLite import from CSV (OUTFILE)