Modify Perl script to output new line every 20 words
Modify Perl script to output new line every 20 words
I'm using an open-source perl script to create a text corpus based on the English language wikipedia dump. The plain text has been extracted, but various punctuation marks and the like still need to be removed. However, the output of this script essentially creates a 7.2GiB text file containing a single line. Due to my needs, I want to alter the script such that it inserts a new line character every 20 words.
So far, I've tried this:
$wordCount=0; while () { $wordCount++; //text processing regex commands here # Remove ellipses s/\.\.\./ /g; # Remove dashes surrounded by spaces (e.g. phrase - phrase) s/\s-+\s/ /g; # Remove dashes between words with no spaces (e.g. word--word) s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g; # Remove dash at a word end (e.g. three- to five-year) s/(\w)-\s/$1 /g; # Remove some punctuation s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g; # Remove trailing space s/ $//; # Remove double single-quotes s/'' / /g; s/ ''/ /g; # Replace accented e with normal e for consistency with the CMU pronunciation dictionary s/?/e/g; # Remove single quotes used as quotation marks (e.g. some 'phrase in quotes') s/\s'([\w\s]+[\w])'\s/ $1 /g; # Remove double spaces s/ / /g; chomp($_); if ($wordCount == 20){ print uc($_) . "\n"; $wordCount=0; } print uc($_) . " "; } print "\n";
However, this doesn't seem to work, as the raw output has only newlines scattered around arbitrarily. I'd like to have the text formatted so it will fit on a typical 1200px wide monitor without word wrapping.
A sample input text from the file is
The Concise Oxford Dictionary of Politics. Proponents of anarchism (known as "anarchists") advocate stateless societies as the only moral form of social organization. There are many types and traditions of anarchism, not all of which are mutually exclusive. Anarchism as a social movement has regularly endured fluctuations in popularity. The term anarchism derives from the Greek ???????, anarchos, meaning "without rulers", its use as a synonym is still common outside the United States. The earliest anarchist themes can be found in the 6th century BC, among the works of Taoist philosopher Laozi, and in later centuries by Zhuangzi and Bao Jingyan. The term "anarchist" first entered the English language in 1642, during the English Civil War, as a term of abuse, used by Royalists against their Roundhead opponents. By the time of the French Revolution some, such as the Enrags, began to use the term positively, in opposition to Jacobin centralisation of power, seeing "revolutionary government" as oxymoronic. By the turn of the 19th century, the English word "anarchism" had lost its initial negative connotation. Modern anarchism sprang from the secular or religious thought of the Enlightenment, particularly Jean-Jacques Rousseau's arguments for the moral centrality of freedom. Anarchism", Encarta Online Encyclopedia 2006 (UK version). From this climate William Godwin developed what many consider the first expression of modern anarchist thought. Godwin was, according to Peter Kropotkin, "the first to formulate the political and economical conceptions of anarchism, even though he did not give that name to the idea s developed in his work", while Godwin attached his anarchist ideas to an early Edmund Burke. The anarcho-communist Joseph Djacque was the first person to describe himself as "libertarian". Unlike Proudhon, he argued that, "it is not the product of his or her labor that the worker has a right to, but to the satisfaction of his or her needs, whatever may be t heir nature. Jesus is sometimes considered the first anarchist in the Christian anarchist tradition. Georges Lechartier wrote that "The true founder of anarchy was Jesus Christ and . In Europe, harsh reaction followed the revolutions of 1848, during which ten countries had experienced brief or long-term social upheaval as groups carried out nationalis t uprisings. After most of these attempts at systematic change ended in failure, conservative elements took advantage of the divided groups of socialists, anarchists, liberals, and na tionalists, to prevent further revolt. Blanquists, Philadelphes, English trade unionists, socialists and social democrats. Due to its links to active workers' movements, the International became a significant organization. Karl Marx became a leading figure in the International and a member of its General Council. Proudhon's followers, the mutualists, opposed Marx's state socialism, advocating political abstentionism and small property holdings. In 1868, following their unsuccessful participation in the League of Peace and Freedom (LPF), Russian revolutionary Mikhail Bakunin and his collectivist anarchist associa tes joined the First International (which had decided not to get involved with the LPF). At first, the collectivists worked with the Marxists to push the First International in a more revolutionary socialist direction. Subsequently, the International became polarised into two camps, with Marx and Bakunin as their respective figureheads. In 1872, the conflict climaxed with a final split between the two groups at the Hague Congress, where Bakunin and James Guillaume were expelled from the International and its headquarters were transferred to New York. In response, the federalist sections formed their own International at the St. Imier Congress, adopting a revolutionary anarchist program. Black Rose Books 2005) ISBN 1-55164-251-4.
There's 7-something gigs worth of text in the file. So using a list or other data structure might be a bit of overkill for these requirements.
What is needed in order to fit my requirements?
Answer by JRFerguson for Modify Perl script to output new line every 20 words
Consider using something like Text::Wrap or Text::Autoformat .
Answer by Wolfger for Modify Perl script to output new line every 20 words
Without knowing more details about this problem, I'd suggest a brute force solution: slurp the entire entry, split to an array based on " ", foreach the array and print "\n" after every 20 elements.
Answer by wespiserA for Modify Perl script to output new line every 20 words
open my $in, '<', $inFileName; open my $out, '>', $outFileName; my $wordcount = 0; while(defined( my $line = <$in> )){ $line=~s/\n//g; #remove newline character #split the words into an array(could use '\W+' instead of ' ') my @words = split ' ', $line; foreach my $word (@words){ $wordCount++; if ($wordCount == 20){ $wordCount = 0; print $out "\n"; } else { print $out uc($word)." "; } } # end of foreach line in input } # end of file while loop close $in; close $out;
Answer by KingZoingo for Modify Perl script to output new line every 20 words
True to Perl, there are various ways to solve this, but one (perverse?!) way to do it is to read the file byte by byte instead of line by line, or slurping the whole thing in. It's rather brute force-ish but it works. Essentially you are trading memory use for disk usage.
#!/usr/bin/perl -w use strict; open(IN, "in.txt") or die; my $rc = 1; my $wc = 0; my $new; while ($rc != 0) { # Read a byte - not safe for Unicode or other double-byte environments! $rc = read IN, $new, 1, 0; # We're only interested if the byte isn't punctuation (POSIX character class). if ($new !~ m/[[:punct:]]/) { # word boundary? if ($new =~ m/ /) { $wc++; if ($wc % 20 == 0) { print "\n"; # 20th word, time for a new line. } } print $new; } # move on to the next byte seek IN, 0, 1; } close(IN);
Answer by Eric Strom for Modify Perl script to output new line every 20 words
First, set perl's input record separator to something frequent and useful, like a space:
$/ = ' ';
then loop over the input word by word:
while (<>) {
trim the word:
s/^\s+|\s+$//g;
skip it if it was all space:
$_ or next;
do any other transforms you need
and then add it to a stack, splitting any internal tabs or other space-like characters:
push @words, split /\s+/;
next, check to see if you have 20 words, and if so, print them:
print join(' ' => splice @words, 0, 20), "\n" while @words >= 20; }
then print anything remaining:
print "@words\n" if @words;
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment