字符串拼接

last update: 5/21/19 10:10 PM

前两天需要大量做字符串的拼接,生成html文件.对于如何快速生成字符串进行了研究和测试.

python

在python中最先想到的拼接方式就是c=c+a很直观.另外发现python还提供了很多种字符串的操作:

c=c+a

c+=a

"{}{}".format(c,a)


c=[] c+=[a] "".join(c)

测试代码:

	# coding: utf-8

import time

count = 20000

beg = time.time()
a = "this is a test sting a"
b = "this is another test string 测试测试测试ing"
res = ''

for i in range(count):
    if i%2 == 0:
        res = res + a + ":"+str(i)
    else:
        res = res + b+ ":"+str(i)
print("time cost +: %s",time.time()-beg)

beg = time.time()
for i in range(count):
    if i%2 == 0:
        res += a + ":"+str(i)
    else:
        res += b+ ":"+str(i)
print("time cost +=: %s",time.time()-beg)

beg = time.time()
for i in range(count):
    if i%2 == 0:
        res = "{} {} {} {}".format(res,a,":",str(i)) 
    else:
        res = "{} {} {} {}".format(res,b,":",str(i)) 
print("time cost format: %s",time.time()-beg)

res = []
beg = time.time()
for i in range(count):
    if i%2 == 0:
        res+=[a,":",str(i)]
    else:
        res+=[b,":",str(i)]
result = " ".join(res)
print("time cost join list: %s",time.time()-beg)


res = 1
beg = time.time()
for i in range(count):
    if i%2 == 0:
        res = res + i
    else:
        res = res + i/2 
print("time cost + num : %s",time.time()-beg)

res = 1
beg = time.time()
for i in range(count):
    if i%2 == 0:
        res += i
    else:
        res += i/2 
print("time cost += num: %s",time.time()-beg)

执行时间对比:

─$ python test.py 
('time cost +: %s', 3.4623711109161377)
('time cost +=: %s', 0.015604972839355469)
('time cost format: %s', 3.1990790367126465)
('time cost join list: %s', 0.013344049453735352)
('time cost + num : %s', 0.005431175231933594)
('time cost += num: %s', 0.005048036575317383)

╰─$ python3 test.py
time cost +: %s 6.251662492752075
time cost +=: %s 0.02083301544189453
time cost format: %s 52.249574184417725
time cost join list: %s 0.020740747451782227
time cost + num : %s 0.005892515182495117
time cost += num: %s 0.005952358245849609

同样对于golang

package main

import (
	"fmt"
	"strings"
	"time"
)

var count = 50000

var a = "this is a test"
var b = "this is another test"
var c string

func testAdd() {
	t := time.Now()
	for i := 0; i < count; i++ {
		c = c + a + b
	}
	fmt.Println("test add:", time.Now().Sub(t))
}

func testPlus() {
	t := time.Now()
	for i := 0; i < count; i++ {
		c += a + b
	}
	fmt.Println("test plus:", time.Now().Sub(t))
}

func testArray() {
	d := []string{}
	t := time.Now()
	for i := 0; i < count; i++ {
		d = append(d, a, b)
	}
	strings.Join(d, "")
	fmt.Println("test array:", time.Now().Sub(t))

}

func main() {
	testAdd()
	testPlus()
	testArray()
}

测试结果:

╰─$ go run testgo/string_plus.go 
test add: 6.464752503s
test plus: 19.349181857s
test array: 10.107554ms

─$ go run testgo/string_plus.go //count=20000
test add: 1.046215762s
test plus: 3.233030314s
test array: 4.82766ms

综合上述测试可见, 对于大量字符串拼接的需求, 使用数组(array/list)存放需要进行拼接的字符串,之后再使用join()拼接,效率高很多.
究其原因,因为数组是指针连接的,产生时使用新的指针指向需要连接的数据,go在append时会自动增加cap为两倍,分配新的内存空间,而数组存放的是指向实际底层数据的指针,字符串数据并未重新生成, 但是对于+拼接方法,重新产生一个新的字符串,需要分配新的内存空间,因此会占用很多时间.
同时也发现python3对于字符串的拼接并不比python2做的好..

另外 go字符串不要使用+= 而python 可以适当使用+=,其性能仅差于list 同时go语言的性能的确远高于python

别人的测试代码:

from cStringIO import StringIO
import timing, commands, os
from sys import argv

def method1():
	out_str = ''
	for num in xrange(loop_count):
		out_str += `num`
	ps_stats()
	return out_str

def method2():
	from UserString import MutableString
	out_str = MutableString()
	for num in xrange(loop_count):
		out_str += `num`
	ps_stats()
	return out_str

def method3():
	from array import array
	char_array = array('c')
	for num in xrange(loop_count):
		char_array.fromstring(`num`)
	ps_stats()
	return char_array.tostring()

def method4():
	str_list = []
	for num in xrange(loop_count):
		str_list.append(`num`)
	out_str = ''.join(str_list)
	ps_stats()
	return out_str

def method5():
	file_str = StringIO()
	for num in xrange(loop_count):
		file_str.write(`num`)
	out_str = file_str.getvalue()
	ps_stats()
	return out_str

def method6():
	out_str = ''.join([`num` for num in xrange(loop_count)])
	ps_stats()
	return out_str


def ps_stats():
	global process_size
	ps = commands.getoutput('ps -up ' + `pid`)
	process_size = ps.split()[15]

def call_method(num):
	global process_size
	timing.start()
	z = eval('method' + str(num))()
	timing.finish()
	print "method", num
	print "time", float(timing.micro()) / 1000000
	print "output size ", len(z) / 1024, "kb"
	print "process size", process_size, "kb"
	print
	
loop_count = 100000
pid = os.getpid()

if len(argv) == 2:
	call_method(argv[1])
else:
	print "Usage: python stest.py <n>\n" \
		"  where n is the method number to test"

Results: Twenty thousand integers

In the first test 20,000 integers were concatenated into a string 86kb long.

Concatenations per second Process size (kB)
Method 1 3770 2424
Method 2 2230 2424
Method 3 29,600 2452
Method 4 83,700 3028
Method 5 90,900 2536
Method 6 119,800 3000

Results: Five hundred thousand integers

Next I tried a run of each method using 500,000 integers concatenated into a string 2,821 kB long. This is a much more serious test and we start to see the size of the python interpreter process grow to accomodate the data structures used in the computation.

Concatenations per second Process size (kB)
Method 3 17,100 8,016
Method 4 74,800 22,872
Method 5 94,900 10,480
Method 6 102,100 22,844

Conclusions

I would use Method 6 in most real programs. It’s fast and it’s easy to understand. It does require that you be able to write a single expression that returns each of the values to append. Sometimes that’s just not convenient to do - for example when there are several different chunks of code that are generating output. In those cases you can pick between Method 4 and Method 5.

Method 4 wins for flexibility. You can use all of the normal slice operations on your list of strings, for insertions, deletions and modifications. The performance for appending is pretty decent.

Method 5 wins out on efficiency. It’s uses less memory than either of the list methods (4 & 6) and it’s faster than even the list comprehension for very large numbers of operations (more than about 700,000). If you’re doing a lot of string appending cStringIO is the way to go.

woodpenker's blog

字符串拼接性能对比

字符串拼接

python