Quantcast

Maximum PC

It is currently Thu May 23, 2013 8:49 am

All times are UTC - 8 hours




Post new topic Reply to topic  [ 43 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Perl Search and Replace help
PostPosted: Thu Jul 22, 2004 11:52 am 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000

Joined: Mon Jun 14, 2004 10:34 am
Posts: 827
Not quite sure how to get this setup...I use a program that can output in csv, and several of the fields have commas. Used to not be a problem, as all the fields had quotes around them, so I whipped up a perl script that would find all instances of "," and replace with | (the separator another script needs). Now, the program changed and only fields with commas in them have quotes around them, so most fields don't have quotes (and thus, the "," replacing won't work).

So, how do I go about searching first in everything that has quotes around it and replace/delete the comma with something else? Once I get rid of the extra commas, I can then replace all the remaining commas with | and my script will work once again.

Thanks!


Top
  Profile  
 
 Post subject:
PostPosted: Thu Jul 22, 2004 1:53 pm 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
Can you give an example of what you have and what you want? Based on your text it's a little fuzzy.

I'm thinking what you are saying is that you could have:

"foo,bar,baz",omg,wtf,lol

and you want:

"foo,bar,baz"|omg|wtf|lol

but you do not want:

"foo|bar|baz"|omg|wtf|lol

Is that right?


Top
  Profile  
 
 Post subject:
PostPosted: Thu Jul 22, 2004 2:09 pm 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000

Joined: Mon Jun 14, 2004 10:34 am
Posts: 827
Pretty much, except I'd have all the quotes stripped out in the final version. If I can get to your example, then I can run another search/replace to get rid of the quotes (I know I need to keep them there initially, so I can use them to guide the script to delete the right commas, and not the wrong ones).

I could even keep the commas in the final product, but I still have to identify them in the beginning. I could say, replace all the commas in quotes with some special character(s) like @~@, then replace the remaining commas with |, and then replace the @~@ back with the commas that were there before.... If I can get the first piece of the puzzle figured out, then I can handle it from there (may be the sloppiest code ever, but it'll work).


Top
  Profile  
 
 Post subject:
PostPosted: Thu Jul 22, 2004 6:49 pm 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
Well, I hate to do this, but I basically solved your problem, at least I think so. I got so wrapped up in trying to find the best way that I found myself going through the code, and just writing it down. Also, I am using a perl IDE (Komodo) which helped in the debugging of most of the additional cases.

First, I split them based on where commas are. The problem is that you could have stuff within quotes that has commas in it. So, I look for an open quote, and if I find one, I try to figure out where the end quote is. Then, I print the stuff inbetween the quotes, a |, and move on. In the case where I just have regular "stuff", I just put it in with a | (except at the end, then I don't want to print one, because that would be silly).

I know there are cases it won't cover where the data is malformed and all freaked out. Give it a shot and let me know if it does what you really were looking for. Here is the sample input I used:
input wrote:
"i like cheese",cheese,is,good
"well, maybe",cheese,is,the,best
cheese,tastes,yummy
what,"are you, up to",cheese,man
today,is,"cheese, cheese, day",foo
today,is,"cheese, cheese, day"
whoops,"i, forgot, my, quote
i have,"two quotes",i have,"two quotes",ok

And here is the output:
output wrote:
i like cheese|cheese|is|good
well, maybe|cheese|is|the|best
cheese|tastes|yummy
what|are you, up to|cheese|man
today|is|cheese, cheese, day|foo
today|is|cheese, cheese, day
whoops|i, forgot, my, quote
i have|two quotes|i have|two quotes|ok

Here's the source:
Code:
#!/usr/bin/perl

open(INFILE,"<infile");
open(OUTFILE,">outfile");

while(<INFILE>)
{
   my $line = $_;
   chomp($line);

   my @commas = split(/,/,$line);

   for(my $i = 0; $i <= $#commas; $i++)
   {
       if($commas[$i] =~ /"/)
       {
           # sub off the first one so we can find out if it
           # is only one item
           $commas[$i] =~ s/"//;
           my $j = $i;
           while($commas[$j] !~ /"/)
           {
               # just keep lookin
               if($j >= $#commas)
               {
                   print "i could not find a matching quote, this line is borked! ";
                   print "i am going to assume it should have been at the end.\n";
                   $j = $#commas;
                   $commas[$j] .= """;
               }
               else
               {
                        # put a comma back since we removed it
                        $commas[$j] .= ",";

                        $j++;
               }
           }
          
           # we know i is the start, and j is the end,
           # so everything inbetween is only one item.
                $commas[$j] =~ s/"//;

           for(my $start = $i; $start <= $j; $start++)
           {
                    print OUTFILE $commas[$start];
                   
                    if($start == $j && $j != $#commas)
                    {
                        print OUTFILE "|";
                    }
           }
           # skip i ahead to j, since we already got all that
           $i = $j;
       }
       else
       {
           print OUTFILE $commas[$i];
           print OUTFILE "|" unless $i == $#commas; # don't print one at the end
       }
   
   }
   print OUTFILE "\n";
}

close(INFILE);
close(OUTFILE);


Top
  Profile  
 
 Post subject:
PostPosted: Thu Jul 22, 2004 11:00 pm 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000

Joined: Mon Jun 14, 2004 10:34 am
Posts: 827
Let me just say you are awesome! All I had to do was fix the line wrapping the forum did, and it worked! One small question...I hope I'm not bothering you....

I've found out some of these fields have newlines in them. Specifically it has a notes section that some people use to put a lot of different info in (it's the Notes section from a PayPal payment, so it's anything from thanks, to a whole address, or nothing at all). This of course is screwing everything up. Now please don't spend a lot of time on this, but if it's something simple can you give me a hint?

Each line should start with an ebay auction number, so that helps. In the meantime, I'll see if I can't beg the programmer who made this to strip these newlines himself. He wouldn't budge on the csv format, but maybe he will on this (he did add the notes field for me, so maybe I'll get lucky).


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 12:04 am 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000

Joined: Mon Jun 14, 2004 10:34 am
Posts: 827
He's considering taking the newlines out, so the newlines screwing up the fields may not be a problem for much longer. This is awesome!


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 10:59 am 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
josetann wrote:
He's considering taking the newlines out, so the newlines screwing up the fields may not be a problem for much longer. This is awesome!


Basically, we would need another delimiter to choose the beginning of the line if you weren't going to take them out at the source. By default "while <STDIN>" will use "\n" as a delimiter, but you can change it to something else and re-work the code a little bit if newlines continue to be a problem. :)


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 12:49 pm 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
Can I play?!?!?! I wanna play?!!?! Please!!!!

Code:
public class Jose {
    public static String parse(String s) {
       
        StringBuffer sb = new StringBuffer(s);
        char oldDelim = ',';
        String newDelim = "|";
       
        int quote = -1;  //-1 = false, 1 = true
        //I should have used a boolean, but this saves a few locs
       
        for (int i = 0; i < sb.length(); i++) {
            if (sb.charAt(i) == '"') {
                quote *= -1;
                sb.deleteCharAt(i--);       //notice the --
            }
           
            if (quote < 0 && sb.charAt(i) == oldDelim)
                sb.replace(i,i+1,newDelim);
        }
       
        return sb.toString();
    }
   
    public static void main(String[] args) {
       
        String[] input = {
            "\"i like cheese\",cheese,is,good",
            "\"well, maybe\",cheese,is,the,best",
            "cheese,tastes,yummy",
            "what,\"are you, up to\",cheese,man",
            "today,is,\"cheese, cheese, day\",foo",
            "today,is,\"cheese, cheese, day\"",
            "whoops,\"i, forgot, my, quote",
            "i have,\"two quotes\",i have,\"two quotes\",ok",
        };
       
        String [] expected  = {
            "i like cheese|cheese|is|good",
            "well, maybe|cheese|is|the|best",
            "cheese|tastes|yummy",
            "what|are you, up to|cheese|man",
            "today|is|cheese, cheese, day|foo",
            "today|is|cheese, cheese, day",
            "whoops|i, forgot, my, quote",
            "i have|two quotes|i have|two quotes|ok"
        };

        for (int i = 0; i < input.length; i++) {
        String s = Jose.parse(input[i]);
            if (s.equals(expected[i]))
                System.out.println("passsed test #" + i);
            else {
                System.out.println("\nFailed test #" + i);
                System.out.println("expected: " + expected[i]);
                System.out.println("returned: " + s + "\n");
            }
        }
       
        System.out.println("done");
        System.exit(0);
    }
}


Like Colby's solution (I think - I can't stand reading PERL, give me nightmares, but she didn't test for it) - I assume that the user won't escape a quote character, ie. \" and " are treated the same.

Silly damn perl scripts and all there ^%/d/s*.#@!/s!(^.[a-z]*&)


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 12:53 pm 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000

Joined: Mon Jun 14, 2004 10:34 am
Posts: 827
Wow, nice! Um...what is it?


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 12:55 pm 
Java Junkie
Java Junkie
User avatar

Joined: Mon Jun 14, 2004 10:23 am
Posts: 24152
Location: Granite Heaven
That, my friend, looks like java. :)


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 1:17 pm 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
josetann wrote:
Wow, nice! Um...what is it?

Jipstyle knows me (and Java) too well. I'm still trying to find the magic Regex that will just split the string correctly without having to loop over it. But I can't seem to seperate...

String s = ""my brain, from the cheese", to quote Colby".;


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 1:24 pm 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
Gadget wrote:
Can I play?!?!?! I wanna play?!!?! Please!!!!

Like Colby's solution (I think - I can't stand reading PERL, give me nightmares, but she didn't test for it) - I assume that the user won't escape a quote character, ie. " and " are treated the same.

Silly damn perl scripts and all there ^%/d/s*.#@!/s!(^.[a-z]*&)


Someone on IRC always refers to it as foul language. I do a lot of stuff in perl :) I did not test for " either, though, assuming that this person isn't a total psycho. I also did not bother with a single quote.

The only things you didn't do were to read the input from a file or write the output to a file ;)

I couldn't think of a magic regex to take care of it for; I got stuck on the situation where you have quotes with commas. There might be a way, but I'm not sure that it would be worth it. If this works fast enough, it'll do the job alright. A massively complex regex would be harder for me to tweak, since it looks even more like jibberish ;)


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 2:52 pm 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
colby wrote:
Gadget wrote:
Can I play?!?!?! I wanna play?!!?! Please!!!!

Like Colby's solution (I think - I can't stand reading PERL, give me nightmares, but she didn't test for it) - I assume that the user won't escape a quote character, ie. " and " are treated the same.

Silly damn perl scripts and all there ^%/d/s*.#@!/s!(^.[a-z]*&)


Someone on IRC always refers to it as foul language. I do a lot of stuff in perl :) I did not test for " either, though, assuming that this person isn't a total psycho. I also did not bother with a single quote.

The person on IRC is right. :)

A few other test cases that Jose should think about....

Should...
,cheese, = |cheese| or |cheese or cheese| or just cheese?
jack,,cheese = jack||cheese or jack|cheese?
"" = "" or do something else? (empty string, !(empty quote"))
""" = " or what?
, = | or ""?


colby wrote:
The only things you didn't do were to read the input from a file or write the output to a file ;)

Actually, I was leaving that part for you. ;)

Hmm, that actually complicates things a bit more.....
"great
,big",cheese

...should be great,big|cheese
...but would come out great|big,cheese
on my version, if I were to read then parse the input a single line at a time. Anyways.... time to play with some opengl. Where's that damn JDJ at... :)
* assuming the user can split the quotes onto seperate lines and whether/how the newlines are significant.


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 2:53 pm 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
Watch Manta show up with 5 lines of Python and 1/10th the experience and make us look silly.... :)


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 3:19 pm 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
Gadget wrote:
Watch Manta show up with 5 lines of Python and 1/10th the experience and make us look silly.... :)


If you could figure out a regex to do it, it really could be short.


Top
  Profile  
 
 Post subject:
PostPosted: Fri Jul 23, 2004 5:24 pm 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000
User avatar

Joined: Mon Jun 14, 2004 4:04 pm
Posts: 985
Location: Earth
Gadget wrote:
Watch Manta show up with 5 lines of Python and 1/10th the experience and make us look silly.... :)


From what I remember of my very limited Python programming, it has really good text parsing mechanisms. In fact, I wrote a small program that would parse text from a file, and resolve the hostnames to IP, I wrote it in C# and a good chunk of my time was spent writing my own regex, a buddy of mine did it in Python in a shorter amount of time.


Top
  Profile  
 
 Post subject:
PostPosted: Sat Jul 24, 2004 12:02 am 
Bitchin' Fast 3D Z8000*
Bitchin' Fast 3D Z8000*
User avatar

Joined: Tue Jun 29, 2004 11:32 pm
Posts: 2555
Location: Somewhere between compilation and linking
DJSPIN80 wrote:
Gadget wrote:
Watch Manta show up with 5 lines of Python and 1/10th the experience and make us look silly.... :)


From what I remember of my very limited Python programming, it has really good text parsing mechanisms. In fact, I wrote a small program that would parse text from a file, and resolve the hostnames to IP, I wrote it in C# and a good chunk of my time was spent writing my own regex, a buddy of mine did it in Python in a shorter amount of time.

Python does have some very nice features. And it is purty code. :)

Does C# also have a Pattern and Matcher class for the regex stuff? I'm just starting to work with them some. On a somewhat related note, the StringTokenizer in Java is suppose to be a 'legacy class' now and String.split() is the prefered method now (I have no idea when this happened either). I used the StringTokenizer class to do most tokenization., but String.split() is actually pretty nice. 90% of the time, I just need something basic....

Code:
String result[] = stringObject.split(regex);

StringTokenizer st = new StringTokenizer(stringObject,"delimiter");
String[] result = new String[st.countTokens()];
for (int i = 0; i < result.length /*or st.hasMoreTokens()*/; i++)
result[i] = st.nextToken();

...are equivalent. Except using a regex, instead of a string delimiter, is a powerful feature. The only String.split() bummer that I've come across so far is that one of the StringTokenizer constructors allows for an additional boolean for returning the delimiters in the tokens, which is missing in the String.split() world. Oh well, I guess everything won't be reduced to a single loc.

On a very unrelated note, the Java vs C++ war is still raging...
http://sys-con.com/story/?storyid=45250&DE=1
...and most of the comments in the forums are ridiculous (note that it is spelled correctly).

Quote:
The simple truth of the matter is that java cannot be faster than C++. C++ compiles straight to machine code, while Java compiles to bytecode, then is compiled to machine code or interpreted as bytecode by the VM.

There is always a new generation unwilling to accept even the possibility of something new, better, or faster....
Java is much slower than C++ - 2000's
CPP is much slower than C - 1990's
C is much slower than assembly - 1970's/1980's
(sure, in the 70's compilers were much less effecient that they are today)


What'ya mean I shouldn't use goto's - jmp this buddy
A side what effect?
hey, it is easier to maintain functions than objects. Dumbass.
WTH do you mean that method cross-cuts multiple objects? My code is like so modular dude.

A "Dr. Sameko" even said that Java could never be faster than C because the JVM is written in C. I guess a C compiler written in Slow, the worlds slowest programming language, has to produce much slower assembly code than the same compiler written in Fast, the world's fastest programming language. After all, the Slow version of the compiler is written in a 'slower' language. I'll probably end up working with this....

/me shoots himself

edit: there are actually a number of good posts towards the bottom. "larry", not to be confused with this Larry, makes several excellent points - and there are links to a good C++ moderated user group thread - and some USC research. Good thing I missed the critical organs. :)


Top
  Profile  
 
 Post subject:
PostPosted: Sat Jul 24, 2004 9:19 am 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000
User avatar

Joined: Mon Jun 14, 2004 4:04 pm
Posts: 985
Location: Earth
Gadget wrote:
Does C# also have a Pattern and Matcher class for the regex stuff? I'm just starting to work with them some.


Yes, and it's very similar to Java. String.Split() is what we use for regex stuff.

Gadget wrote:
On a very unrelated note, the Java vs C++ war is still raging...
http://sys-con.com/story/?storyid=45250&DE=1
...and most of the comments in the forums are ridiculous (note that it is spelled correctly).


LOL!!! It never ceases to crack me up these folks who think that Java is slower than C++. In some apps, maybe, but not all. I was reading an article sometime ago that compared C#, C++, and Java in terms of speed. Their benchmark was an empty, nested for loop that would iterate to 10 Billion:

for(int i = 0; i < 10000000000; i++)
for(int j = 0; j < 10000000000; j++)
{ }

From the benchies, Java did it less than a second, apparently, the VM was smart enough to ignore the empty for loop, whereas C# and C++ practically iterated through the entire thing.


I used to be a C++ die-hard until I started dabbling more and more into application programming, started reading articles regarding benchmarking languages, and just plainly started talking with other programmers. Java's a powerful language that has libraries that are far beyond the scope of C++ libraries. The only real saving grace of C++ is STL, IMHO.

C#'s a powerful language as well, but it's bytecode is, IMHO, not as cross platform as Java which is putting me in a tight situation since I'm thinking of writing a CMS, but can't decide on C# (ASP.NET) or Java (JSP).


Top
  Profile  
 
 Post subject:
PostPosted: Sat Jul 24, 2004 9:36 am 
Bitchin' Fast 3D Z8000
Bitchin' Fast 3D Z8000
User avatar

Joined: Mon Jun 14, 2004 4:04 pm
Posts: 985
Location: Earth
BTW, I read through some of the feedbacks and gosh, are people moronic:

"Since the Java code isn''t really deleting memory ( as it''s GC''d ) then none of the C++ tests should try to delete memory"

Obviously a moron. The
Code:
delete
keyword doesn't delete the memory, it removes the reference to the object it's pointing to in the heap. Hence, it's similar to Garbage Collection, but in GC, objects are managed smarter.

" None of these tests look at resource usage... how much CPU time was used? How much memory was used?"

What a nincompoop. Hasn't he heard about algorithms? Run Time's? Argh!!!

Dr. Valentin Samko doesn't know what he's talking about. std::string is obviously slower than char*, the author actually gave C++ an advantage here. Also, heapsort.cpp, while it is using C code, is perfectly legit in a C++ benchmark. IIRC, C is a subset of C++, thus you can maintain your legacy C code. To be honest, I'd rather write my own data structures in C.

I didn't make it through the rest of the article, it's just giving me a headache. :shock:


Top
  Profile  
 
 Post subject:
PostPosted: Sat Jul 24, 2004 10:06 am 
iron colbinator
iron colbinator
User avatar

Joined: Tue May 25, 2004 2:25 pm
Posts: 2761
Location: Washington, the state
Our software is all Java, and I have to say, a lot of "making java faster" isn't just making your code faster (better algorithms, less memory consumption, etc), it's knowing how the JVM works.

I've done testing of our Java code with Sun, Blackdown, IBM, and BEA's JRockit JVMs. They each seem to have a slightly different "focus" and different strengths and weaknesses. I find JRockit the most intriguing (and so far the best for server-side applications) because of its concurrent GC, adaptive runtime, and in-line test/debugging tools (basically, there is a profiler built into the JVM).

I think most of the people who extensively criticise Java:
1) Have not really written "real" code in Java
2) Do not understand how Java and the JVM work
3) Suck at OOP

8)


Top
  Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 43 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 8 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group