Thursday, 16 August 2007

STATA commands: hexdump and filefilter

Brendan McConnell Writes:

Hi there computer people,

I’m emailing as I finally figured out a solution to a problem I was having with a dataset, and I thought I’d tell you in case people have the same problem in the future, and I felt like I was banging my head against a brick wall trying to figure it out.

The problem I had was that someone at the DfES messed up a .txt dataset that I was trying to insheet into Stata; instead of the rows being separated by carriage returns, they were separated by tabs (columns were separated by commas) and Stata didn’t like this, as it read it in as a single row. I couldn’t rectify the problem with a “replace all” type command in any word editors as the programs couldn’t load the dataset fully (it about 85 MB, with all the data on a single line). Stat-transfer wouldn’t work either

I finally figured out that I could use the commands hexdump (hexadecimal dump – means nothing to me!) and filefilter from within Stata to alter the dataset before it is opened. It’s so simple, but it worked perfectly for my problem. I’ve attached the relevant bit of my log file for you to see how it works – it just let me replace all tabs with carriage returns, changing the dataset from being 1 line with 87,302,806 characters to 595,377 lines with something like 30 columns. Perfect.

So now I’ve figured it out in Stata, would there have been a better way? Is there a UNIX version of Stat-transfer? Also, is there a command line in the UNIX editor bit (not sure what it’s called) that I could have used to have done the same thing. Irrelevant for this task, but I’d like to know for future reference.

Tanvi Desai Writes:

Hi Brendon,

sounds like you found a very good method.

There are two possible alternatives, but I'm not sure they are an improvement on what you've done. The first method (which I usually use) is to do a find and replace on UNIX...

The other quick and dirty way is to infile into excel first where you can specify your delimiter, and then stat-transfer, but this obviously only works for the smaller datasets because of excel's size limits.

No comments: