Malware Samples

I’m always on the lookout for good malware datasets so decided to keep this log of online repositories.

VirusTotal have an API and allow yout to refine your search based on a number of factors such as: number of AV scanners that have marked the sample as malicious or trusted; protocols (e.g. http); network activity etc

VirusShare have a bucketload of samples and have very kindly curated some useful sample sets (scroll to bottom of the Torrent page) for Android, Linux, 64-bit Win and Ransomware

Maltrieve is not something I have used but looks like an interesting crawler for Malware across a number of sites

Corrupt SQLite Databases

Nightmare. Once they get over ~2GB and you transfer them over the wire, they always seem to become corrupt. Here’s a fix

sqlite3 ../../Data/Crime/DBNAME.db
sqlite> .mode insert
sqlite> .output DBNAME_export.sql
sqlite> .dump
sqlite> .exit

mv DBNAME.db DBNAME.db.original

sqlite3 DBNAME.db < DBNAME_export.sql

Extracting & Cleaning Twitter Data

It seems like I spend most of my life extracting and cleaning Twitter data. Tweets are always full of garbage that cause errors when processing them.

Part I: A few useful data extraction snippets…

Before dumping it is useful to remove all uses of your intended CSV separator and line breaks from your database values (TwitterData here is the text column, but this may also apply to userscreenname, location etc.)

UPDATE TwitterData SET text=REPLACE(text, '|', '') WHERE text like '%|%';

UPDATE TwitterData SET text=REPLACE(text, '\n', '') WHERE text like '%\n%';

UPDATE TwitterData SET text=REPLACE(text, '\r', '') WHERE text like '%\r%';

Dump monthly data from SQLite to CSV and import into new table (this is useful when tweets are in SQLite format and have become too voluminous to manipulate due to DB file size – TIP: keeping them in JSON and manipulating them from there is much easier if you can!). This also allows you to clean up the data while in CSV form, before importing back into a DB (see second part of this post).

sudo sqlite3 DBNAME.sqlite
sqlite> .mode csv
sqlite> .separator |
sqlite> .out OUTFILE.csv
sqlite> select created, geolocLat, geolocLong, text from TwitterData where created like '02/__/2014 __:__:__';
sqlite> .output stdout
sqlite> .quit

You may also need to transform the date (COSMOS has a specific date format). You can do this at the DB query stage (where created is the stored timestamp)

select substr(created,4,2)||'/'||substr(created,1,2)||'/'||substr(created,7,4)||' '||substr(created,12,8), geolocLat, geolocLong, text from TwitterData

Import CSV into a new table

Note: you may also want to declare column types here (e.g. geolocLat needs to be declared REAL if you want to perform > or < queries)

sudo sqlite3 feb14.db
sqlite> create table TwitterData (created, geolocLat, geolocLong, text);
sqlite> .mode csv
sqlite> .separator |
sqlite> .import OUTFILE.csv TwitterData

Select distinct dates within a month from a database of timestamped data (assumes column named ‘created’ with format ‘mm/dd/yyyy hh:mm:ss’)

sqlite> select distinct substr(created,0,11) from TwitterData where created like '06/__/2014 __:__:__';

Part II: A few useful cleaning snippets…

General note, I use INFILE > OUTFILE so I can check the output and not modify the input in case the query is incorrect. Use ‘sed -i’ to make changes in-file

Print first 10 lines of a file to inspect it

sed -n "1,10p" FILENAME | cat -n

Remove non-ASCII characters from a file


sudo LANG=C sed -i 's/[\d128-\d255]//g' FILENAME


sudo LC_ALL=C sed -i "" 's/[\d128-\d255]//g' FILENAME

Remove line breaks in the middle of a line

sudo awk '{printf "%s%s",(/^"/&&NR>1)?RS:"",$0}' INFILE > OUTFILE

Remove all lines that do not start with a date (e.g 02/01/2014 12:01:34)

sudo awk '{printf "%s%s",(/^[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]/&&NR>1)?RS:"",$0}' INFILE > OUTFILE

Delete a range of lines in a file (lines 2360-2361 in this case)

sed "2360,2361d" INFILE > OUTFILE

Delete lines shorter than n characters (n=10 in this case)

sudo sed -r '/^.{,10}$/d' INFILE > OUTFILE

Delete quotation marks from file


sudo sed -i 's/\"//g' FILENAME


sudo LC_ALL=C sed -i "" 's/\"//g' FILENAME

Delete 4th instance of |

sed 's/|//4' INFILE > OUTFILE
Delete all instances of | after 4th instance
sudo sed -i 's/|//g4' INFILE
Grep lines not starting with specific dates
^(?!24/04/2015 |25/04/2015 |26/04/2015 |27/04/2015 |28/04/2015 |29/04/2015 |30/04/2015 ).*
 Part III – Full script for cleaning monthly data (this won’t work for everyone of course)
First run SQLite dump to CSV
Then these commands…
sudo sed -i 's/\"//g' OUTFILE

sudo awk '{printf "%s%s",(/^[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]/&&NR>1)?RS:"",$0}' INFILE > OUTFILE 

sudo LANG=C sed -i 's/[\d128-\d255]//g' OUTFILE 

sudo sed -i 's/|//g4' OUTFILE

Then run SQLite import from CSV (OUTFILE)

p-Values – think again

The American Statistical Association (ASA) has released a strong and clear statement on the proper use and interpretation of the p-value. 

This is a timely and important announcement because I regularly read and review scientific research articles that rely heavily on the p-value to support the authors’ hypotheses as evidence that ‘this must be right because p<0.05…’

“The p-value was never intended to be a substitute for scientific reasoning,” said Ron Wasserstein, the ASA’s executive director. “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold.”

This is the way it is being used though, for sure.

“Over time it appears the p-value has become a gatekeeper for whether work is publishable, at least in some fields,” said Jessica Utts, ASA president. “This apparent editorial bias leads to the ‘file-drawer effect,’ in which research with statistically significant outcomes are much more likely to get published, while other work that might well be just as important scientifically is never seen in print. It also leads to practices called by such names as ‘p-hacking’ and ‘data dredging’ that emphasize the search for small p-values over other statistical and scientific reasoning.”

Absolutely. This is the problem we now face. If we want to clarify the role of the p-value in our research, we need to educate researchers in the art of scientific reasoning and inference using quantitative methods – submitting a manuscript that doesn’t make a big deal of the p-value in support of the major claims of a research finding is a big gamble – and why would we take that? We tick all the boxes to please the reviewers, right? We’re academics after all! This is why the ASA statement is so important. It’s something that can be used to justify the limited use of the p-value metric in an article, and also a rebuttal reference that can be used when peer reviewing to give a polite reminder that “hey, there are other ways to make your claims stronger and p-values ain’t the best way”

The statement’s six principles, many of which address misconceptions and misuse of the p-value, are the following:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone (this is used a lot in data science papers!)
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis (yet this is regularly used to support such a claim).

It is further suggested that researchers should “emphasize estimation over testing such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence such as likelihood ratios or Bayes factors; and other approaches such as decision-theoretic modeling and false discovery rates.”
The ASA statement is signed off with the following remark, and let’s hope this reaches the masses…

“What we hope will follow is a broad discussion across the scientific community that leads to a more nuanced approach to interpreting, communicating, and using the results of statistical methods in research.”

This is not a new problem, or a new debate. But the ASA saying it out loud will hopefully make people listen up!

Media Training

Media training today at BBC studios in Llandaff. Very interesting introductory talks from Prof Richard Sambrook of Cardiff School of Journalism – ex director of BBC News (for some 20 years+) among other experience; and Claire Sanders – director of Communications and Marketing at Cardiff University and member of UEB.
Main takeaway messages were:

  • Pick 3-5 key messages that are clear take home messages
  • Understand the audience of the media outlet — tabloid, science, broadsheet, radio are all different and will respond to different stimuli.
  • “Show, don’t tell” – this means that it is much better to present substantial findings, outcomes or recommendations based on tangible evidence, than to talk around a topic without this. Without experimentation or substantive study, the story will not penetrate the hearts and minds of the recipients. For impact, “show, don’t tell”

This afternoon we have been developing key messages based on our own stories. Tomorrow we will film these on TV and record them on radio at BBC studios. This is all be superbly facilitated by Kevin Bentley and Karen Ainley of Mosaic Publicity. Both with 20+ years experience in BBC TV and Radio.

I used a recent press release based on our research on cybersecurity and social media. Read it here.

Key messages:

  • Cyber criminals are using real world events to post links to Twitter that contain malware
  • At Cardiff, we have trained a machine to recognise the predictive signals that distinguish between malicious and benign URLs using computer activity
  • Most anti virus uses a fingerprint of malware based on the code it executes, whereas we propose to generate a fingerprint based on the computer activity during code execution – so we can pick up previously unseen malware code
  • This is important to Twitter users because malware infection can lead to increased risk of identity theft or becoming part of a network of machines used to launch further attacks
  • Corporate managers should also be concerned as users of business IT, or who bring their own computer to work, can infect corporate networks — the same issues exist and the risk of IPR loss also becomes prevalent 



While on the subject of keeping COSMOS happy – it does require specific date formats when importing CSV files. The date column needs to be the first column in the CSV file and the date needs to be in the following format

dd/mm/yyyy hh:mm:ss"


"28/04/2015 10:24:37"

None of your American month first malarky, thank you very much

Code Tweaks

I’m working with some CSV text data today. Tweets actually. People seem to think it’s a good idea to put hard returns into their tweets, which makes my life difficult when importing then from text files into something useful like COSMOS 🙂

Lines usually start with a " character and this snippet will remove any lines that do not, and bring them all onto one line…making a well formatted row in the CSV file and keeping COSMOS (and me) happy.


Find and replace (with single space character) using grep in TextWrangler. Job done.

GW4 Coding

I attended a workshop in Bristol today, organised by James Davenport from Bath University, to kick off a GW4 collaboration on coding pedagogy. GW4 is a collaborative effort between Cardiff, Bath, Bristol and Exeter Universities. Between us (and Cardiff Met) we have 5 different approaches to teaching programming. Between us we teach an array of languages – C, Haskell, Python, Java..the list goes on. We teach these in different years of study, at different levels, in different ways. Add to that the National Software Academy at Cardiff, which is a hands-on applied undergraduate course with stronger emphasis on software engineering and programming skills than traditional academic degrees.

In essence, there isn’t a common framework or accepted practice for teaching how to develop robust and secure software. This is something we intend to remedy. We will be meeting again next month to plan the way forward.